463

Qwen3-30B-A3B-GGUF

--- base_model: Qwen/Qwen3-30B-A3B language: - en library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE license: apache-2.0 tags: - qwen3 - qwen - unsloth - transformers --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. <em

--- base_model: - openai/gpt-oss-120b license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - openai - unsloth --- > [!NOTE] > The F16 quant is gpt-oss in its **original** precision. All GGUFs have our fixes. [Read our guide here.](https://docs.unsloth.ai/basics/gpt-oss) > See our collection for all versions of gpt-oss i

NaNK

license:apache-2.0

151,125

--- base_model: Qwen/Qwen3-0.6B language: - en library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE license: apache-2.0 tags: - qwen3 - qwen - unsloth - transformers --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn

NaNK

license:apache-2.0

85,127

Qwen2.5-VL-7B-Instruct-GGUF

--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal - unsloth library_name: transformers ---

NaNK

license:apache-2.0

82,509

Qwen2.5-7B-Instruct

NaNK

license:apache-2.0

80,576

llama-3-8b-Instruct-bnb-4bit

tinyllama-chat-bnb-4bit

NaNK

llama

72,210

meta-Llama-3.1-8B-unsloth-bnb-4bit

NaNK

llama

68,758

Llama-3.2-1B-Instruct-unsloth-bnb-4bit

NaNK

llama

68,504

gpt-oss-120b

Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of these open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the larger `gpt-oss-120b` model. Check out `gpt-oss-20b` for the smaller model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller `gpt-oss-20b` can even be fine-tuned on consumer hardware.

NaNK

license:apache-2.0

67,349

gemma-3-4b-it-GGUF

--- base_model: google/gemma-3-4b-it language: - en library_name: transformers license: gemma tags: - unsloth - transformers - gemma3 - gemma - google --- See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-g

GLM-4.6-GGUF

Read our How to Run GLM-4.6 Guide! > [!NOTE] > Please use latest version of `llama.cpp`. This GGUF includes multiple Unsloth chat template fixes! For `llama.cpp`, please use `--jinja` > Unsloth Dyn...

license:mit

63,537

126

mistral-7b-instruct-v0.3-bnb-4bit

--- language: - en library_name: transformers license: apache-2.0 tags: - unsloth - transformers - mistral - mistral-7b - mistral-instruct - instruct base_model: mistralai/Mistral-7B-Instruct-v0.3 ---

NaNK

license:apache-2.0

59,133

Llama-3.2-3B-Instruct-unsloth-bnb-4bit

Mistral-Small-24B-Instruct-2501

NaNK

license:apache-2.0

47,745

gemma-3-27b-it-GGUF

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (12B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Gemma 3 (12B) | ▶️ Start on Colab | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

Mistral-Small-3.1-24B-Instruct-2503

NaNK

license:apache-2.0

47,663

gemma-3-270m-it

—

47,105

phi-4-unsloth-bnb-4bit

Qwen3-14B-unsloth-bnb-4bit

Mistral-Small-3.2-24B-Instruct-2506-GGUF

> [!NOTE] > Includes our GGUF chat template fixes! Tool calling works as well! If you are using `llama.cpp`, use `--jinja` to enable the system prompt. > Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. - Temperature of: 0.15 - Set topp to: 1.00 - Max tokens (context length): 128K - Fine-tune Mistral v0.3 (7B) for free using our Google Colab notebook here-Conversational.ipynb)! - View the rest of our notebooks in our docs here. Mistral-Small-3.2-24B-Instruct-2506 is a minor update of Mistral-Small-3.1-24B-Instruct-2503. Small-3.2 improves in the following categories: - Instruction following: Small-3.2 is better at following precise instructions - Repetition errors: Small-3.2 produces less infinite generations or repetitive answers - Function calling: Small-3.2's function calling template is more robust (see here and examples) In all other categories Small-3.2 should match or slightly improve compared to Mistral-Small-3.1-24B-Instruct-2503. Key Features - same as Mistral-Small-3.1-24B-Instruct-2503 We compare Mistral-Small-3.2-24B to Mistral-Small-3.1-24B-Instruct-2503. For more comparison against other models of similar size, please check Mistral-Small-3.1's Benchmarks' | Model | Wildbench v2 | Arena Hard v2 | IF (Internal; accuracy) | |-------|---------------|---------------|------------------------| | Small 3.1 24B Instruct | 55.6% | 19.56% | 82.75% | | Small 3.2 24B Instruct | 65.33% | 43.1% | 84.78% | Small 3.2 reduces infitine generations by 2x on challenging, long and repetitive prompts. | Model | Infinite Generations (Internal; Lower is better) | |-------|-------| | Small 3.1 24B Instruct | 2.11% | | Small 3.2 24B Instruct | 1.29% | | Model | MMLU | MMLU Pro (5-shot CoT) | MATH | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT )| MBPP Plus - Pass@5 | HumanEval Plus - Pass@5 | SimpleQA (TotalAcc)| |--------------------------------|-----------|-----------------------|------------------------|------------------------|---------------------------|--------------------|-------------------------|--------------------| | Small 3.1 24B Instruct | 80.62% | 66.76% | 69.30% | 44.42% | 45.96% | 74.63% | 88.99% | 10.43% | | Small 3.2 24B Instruct | 80.50% | 69.06% | 69.42% | 44.22% | 46.13% | 78.33% | 92.90% | 12.10% | | Model | MMMU | Mathvista | ChartQA | DocVQA | AI2D | |--------------------------------|------------|-----------|-----------|-----------|-----------| | Small 3.1 24B Instruct | 64.00% | 68.91%| 86.24% | 94.08% | 93.72% | | Small 3.2 24B Instruct | 62.50% | 67.09% | 87.4% | 94.86% | 92.91% | The model can be used with the following frameworks; - `vllm (recommended)`: See here - `transformers`: See here Note 1: We recommend using a relatively low temperature, such as `temperature=0.15`. Note 2: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend to use the one provided in the SYSTEMPROMPT.txt file. Doing so should automatically install `mistralcommon >= 1.6.2`. You can also make use of a ready-to-go docker image or on the docker hub. We recommand that you use Mistral-Small-3.2-24B-Instruct-2506 in a server/client setting. Note: Running Mistral-Small-3.2-24B-Instruct-2506 on GPU requires ~55 GB of GPU RAM in bf16 or fp16. 2. To ping the client you can use a simple Python snippet. See the following examples. Take leverage of the vision capabilities of Mistral-Small-3.2-24B-Instruct-2506 to take the best choice given a scenario, go catch them all ! Mistral-Small-3.2-24B-Instruct-2506 is excellent at function / tool calling tasks via vLLM. E.g.: Mistral-Small-3.2-24B-Instruct-2506 will follow your instructions down to the last letter ! You can also use Mistral-Small-3.2-24B-Instruct-2506 with `Transformers` ! To make the best use of our model with `Transformers` make sure to have installed `mistral-common >= 1.6.2` to use our tokenizer. Then load our tokenizer along with the model and generate:

Qwen3-4B-Instruct-2507

NaNK

license:apache-2.0

33,621

Devstral-Small-2507-GGUF

> [!NOTE] > You should use `--jinja` to enable the system prompt in `llama.cpp`. Devstral 1.1, with tool-calling and optional vision support. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Mistral v0.3 (7B) for free using our Google Colab notebook here-Conversational.ipynb)! - Read our Blog about Devstral 1.1 support: docs.unsloth.ai/basics/devstral - View the rest of our notebooks in our docs here. Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench which positionates it as the #1 open source model on this benchmark. It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. As a coding agent, Devstral is text-only and before fine-tuning from `Mistral-Small-3.1` the vision encoder was removed. For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community. Updates compared to `Devstral Small 1.0`: - Improved performance, please refer to the benchmark results. - `Devstral Small 1.1` is still great when paired with OpenHands. This new version also generalizes better to other prompts and coding environments. - Supports Mistral's function calling format. Key Features: - Agentic coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents. - lightweight: with its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an appropriate model for local deployment and on-device use. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. - Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size. Devstral Small 1.1 achieves a score of 53.6% on SWE-Bench Verified, outperforming Devstral Small 1.0 by +6,8% and the second best state of the art model by +11.4%. | Model | Agentic Scaffold | SWE-Bench Verified (%) | |--------------------|--------------------|------------------------| | Devstral Small 1.1 | OpenHands Scaffold | 53.6 | | Devstral Small 1.0 | OpenHands Scaffold | 46.8 | | GPT-4.1-mini | OpenAI Scaffold | 23.6 | | Claude 3.5 Haiku | Anthropic Scaffold | 40.6 | | SWE-smith-LM 32B | SWE-agent Scaffold | 40.2 | | Skywork SWE | OpenHands Scaffold | 38.0 | | DeepSWE | R2E-Gym Scaffold | 42.2 | When evaluated under the same test scaffold (OpenHands, provided by All Hands AI 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 and Qwen3 232B-A22B. We recommend to use Devstral with the OpenHands scaffold. You can use it either through our API or by running locally. API Follow these instructions to create a Mistral account and get an API key. Then run these commands to start the OpenHands docker container. The model can also be deployed with the following libraries: - `vllm (recommended)`: See here - `mistral-inference`: See here - `transformers`: See here - `LMStudio`: See here - `llama.cpp`: See here - `ollama`: See here Expand = 0.9.1`](https://github.com/vllm-project/vllm/releases/tag/v0.9.1): Also make sure to have installed `mistralcommon >= 1.7.0`. You can also make use of a ready-to-go docker image or on the docker hub. We recommand that you use Devstral in a server/client setting. 2. To ping the client you can use a simple Python snippet. Then load our tokenizer along with the model and generate: Make sure you launched an OpenAI-compatible server such as vLLM or Ollama as described above. Then, you can use OpenHands to interact with `Devstral Small 1.1`. In the case of the tutorial we spineed up a vLLM server running the command: The server address should be in the following format: `http:// :8000/v1` The easiest way to launch OpenHands is to use the Docker image: Then, you can access the OpenHands UI at `http://localhost:3000`. When accessing the OpenHands UI, you will be prompted to connect to a server. You can use the advanced mode to connect to the server you launched earlier. Fill the following fields: - Custom Model: `openai/mistralai/Devstral-Small-2507` - Base URL: `http:// :8000/v1` - API Key: `token` (or any other token you used to launch the server if any) Make sure you launched an OpenAI-compatible server such as vLLM or Ollama as described above. Then, you can use OpenHands to interact with `Devstral Small 1.1`. In the case of the tutorial we spineed up a vLLM server running the command: The server address should be in the following format: `http:// :8000/v1` You can follow installation of Cline here. Then you can configure the server address in the settings. OpenHands:Understanding Test Coverage of Mistral Common We can start the OpenHands scaffold and link it to a repo to analyze test coverage and identify badly covered files. Here we start with our public `mistral-common` repo. After the repo is mounted in the workspace, we give the following instruction The agent will first browse the code base to check test configuration and structure. Then it sets up the testing dependencies and launches the coverage test: Finally, the agent writes necessary code to visualize the coverage, export the results and save the plots to a png. At the end of the run, the following plots are produced: First initialize Cline inside VSCode and connect it to the server you launched earlier. We give the following instruction to builde the video game: Don't hesitate to iterate or give more information to Devstral to improve the game!

license:apache-2.0

33,120

Qwen3-14B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-14B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 14.8B - Number of Paramaters (Non-Embedding): 13.2B - Number of Layers: 40 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-14B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-14B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK

license:apache-2.0

32,939

Kimi-K2-Instruct-GGUF

Learn how to run Kimi-K2 Dynamic GGUFs - Read our Guide! Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - You can now use the latest update of llama.cpp to run the model. - For complete detailed instructions, see our guide: docs.unsloth.ai/basics/kimi-k2 It is recommended to have at least 128GB unified RAM memory to run the small quants. With 16GB VRAM and 256 RAM, expect 5+ tokens/sec. For best results, use any 2-bit XL quant or above. Set the temperature to 0.6 recommended) to reduce repetition and incoherence. 📰   Tech Blog     |     📄  Paper Link (coming soon) 2025.7.15 - We have updated our tokenizer implementation. Now special tokens like `[EOS]` can be encoded to their token ids. - We fixed a bug in the chat template that was breaking multi-turn tool calls. Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities. Key Features - Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability. - MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up. - Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving. Model Variants - Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions. - Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking. | | | |:---:|:---:| | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 1T | | Activated Parameters | 32B | | Number of Layers (Dense layer included) | 61 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 7168 | | MoE Hidden Dimension (per Expert) | 2048 | | Number of Attention Heads | 64 | | Number of Experts | 384 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 160K | | Context Length | 128K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | Benchmark Metric Kimi K2 Instruct DeepSeek-V3-0324 Qwen3-235B-A22B (non-thinking) Claude Sonnet 4 (w/o extended thinking) Claude Opus 4 (w/o extended thinking) GPT-4.1 Gemini 2.5 Flash Preview (05-20) LiveCodeBench v6 (Aug 24 - May 25) Pass@1 53.7 46.9 37.0 48.5 47.4 44.7 44.7 MultiPL-E Pass@1 85.7 83.1 78.2 88.6 89.6 86.7 85.6 SWE-bench Verified (Agentless Coding) Single Patch w/o Test (Acc) 51.8 36.6 39.4 50.2 53.0 40.8 32.6 SWE-bench Verified (Agentic Coding) Single Attempt (Acc) 65.8 38.8 34.4 72.7 72.5 54.6 — (Agentic Coding) --> Multiple Attempts (Acc) 71.6 — — 80.2 79.4 — — SWE-bench Multilingual (Agentic Coding) Single Attempt (Acc) 47.3 25.8 20.9 51.0 — 31.5 — TerminalBench Inhouse Framework (Acc) 30.0 — — 35.5 43.2 8.3 — TerminalBench --> Terminus (Acc) 25.0 16.3 6.6 — — 30.3 16.8 Aider-Polyglot Acc 60.0 55.1 61.8 56.4 70.7 52.4 44.0 Tau2 retail Avg@4 70.6 69.1 57.0 75.0 81.8 74.8 64.3 Tau2 airline Avg@4 56.5 39.0 26.5 55.5 60.0 54.5 42.5 Tau2 telecom Avg@4 65.8 32.5 22.1 45.2 57.0 38.6 16.9 AIME 2024 Avg@64 69.6 59.4 40.1 43.4 48.2 46.5 61.3 AIME 2025 Avg@64 49.5 46.7 24.7 33.1 33.9 37.0 46.6 HMMT 2025 Avg@32 38.8 27.5 11.9 15.9 15.9 19.4 34.7 CNMO 2024 Avg@16 74.3 74.7 48.6 60.4 57.6 56.6 75.0 PolyMath-en Avg@4 65.1 59.5 51.9 52.8 49.8 54.0 49.9 GPQA-Diamond Avg@8 75.1 68.4 62.9 70.0 74.9 66.3 68.2 Humanity's Last Exam (Text Only) - 4.7 5.2 5.7 5.8 7.1 3.7 5.6 IFEval Prompt Strict 89.8 81.1 83.2 87.6 87.4 88.0 84.3 Multi-Challenge Acc 54.1 31.4 34.0 46.8 49.0 36.4 39.5 SimpleQA Correct 31.0 27.7 13.2 15.9 22.8 42.3 23.3 Livebench Pass@1 76.4 72.4 67.6 74.8 74.6 69.8 67.8 • Bold denotes global SOTA, and underlined denotes open-source SOTA. • Data points marked with are taken directly from the model's tech report or blog. • All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length. • Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model. • To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2. • Some data points have been omitted due to prohibitively expensive evaluation costs. Benchmark Metric Shot Kimi K2 Base Deepseek-V3-Base Qwen2.5-72B Llama 4 Maverick • We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study. • All models are evaluated using the same evaluation protocol. 4. Deployment > [!Note] > You can access Kimi K2's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you. > > The Anthropic-compatible API maps temperature by `realtemperature = requesttemperature 0.6` for better compatible with existing applications. Our model checkpoints are stored in the block-fp8 format, you can find it on Huggingface. Currently, Kimi-K2 is recommended to run on the following inference engines: Deployment examples for vLLM and SGLang can be found in the Model Deployment Guide. Once the local inference service is up, you can interact with it through the chat endpoint: > [!NOTE] > The recommended temperature for Kimi-K2-Instruct is `temperature = 0.6`. > If no special instructions are required, the system prompt above is a good default. Kimi-K2-Instruct has strong tool-calling capabilities. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. The following example demonstrates calling a weather tool end-to-end: The `toolcallwithclient` function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For streaming output and manual tool-parsing, see the Tool Calling Guide. Both the code repository and the model weights are released under the Modified MIT License. If you have any questions, please reach out at [email protected].

—

32,597

209

gemma-3-27b-it

NaNK

—

32,461

Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of these open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out `gpt-oss-120b` for the larger model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger `gpt-oss-120b` can be fine-tuned on a single H100 node.

Qwen2.5-14B-Instruct-unsloth-bnb-4bit

NaNK

license:apache-2.0

27,667

phi-4

llama

27,665

Magistral-Small-2509-GGUF

Learn to run Magistral 1.2 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. Read our in-depth guide about Magistral 1.2: docs.unsloth.ai/basics/magistral - Fine-tune Magistral 1.2 for free using our Kaggle notebook here-Reasoning-Conversational.ipynb&accelerator=nvidiaTeslaT4)! - View the rest of our notebooks in our docs here. Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. - Multimodality: The model now has a vision encoder and can take multimodal inputs, extending its reasoning capabilities to vision. - Performance upgrade: Magistral Small 1.2 should give you significatively better performance than Magistral Small 1.1 as seen in the benchmark results. - Better tone and persona: You should experiment better LaTeX and Markdown formatting, and shorter answers on easy general prompts. - Finite generation: The model is less likely to enter infinite generation loops. - Special think tokens: [THINK] and [/THINK] special tokens encapsulate the reasoning content in a thinking chunk. This makes it easier to parse the reasoning trace and prevents confusion when the '[THINK]' token is given as a string in the prompt. - Reasoning prompt: The reasoning prompt is given in the system prompt. - Reasoning: Capable of long chains of reasoning traces before providing an answer. - Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi. - Vision: Vision capabilities enable the model to analyze images and reason based on visual content in addition to text. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. Performance might degrade past 40k but Magistral should still give good results. Hence we recommend to leave the maximum model length to 128k and only lower if you encounter low performance. | Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) | |--------------------------|---------------|---------------|--------------|--------------------| | Magistral Medium 1.2 | 91.82% | 83.48% | 76.26% | 75.00% | | Magistral Medium 1.1 | 72.03% | 60.99% | 71.46% | 59.35% | | Magistral Medium 1.0 | 73.59% | 64.95% | 70.83% | 59.36% | | Magistral Small 1.2 | 86.14% | 77.34% | 70.07% | 70.88% | | Magistral Small 1.1 | 70.52% | 62.03% | 65.78% | 59.17% | | Magistral Small 1.0 | 70.68% | 62.76% | 68.18% | 55.84% | Please make sure to use: - `topp`: 0.95 - `temperature`: 0.7 - `maxtokens`: 131072 We highly recommend including the following system prompt for the best results, you can edit and customise it if needed for your specific use case. The `[THINK]` and `[/THINK]` are special tokens that must be encoded as such. Please make sure to use mistral-common as the source of truth. Find below examples from libraries supporting `mistral-common`. We invite you to choose, depending on your use case and requirements, between keeping reasoning traces during multi-turn interactions or keeping only the final assistant response. Make sure you install the latest `Transformers` version:

license:apache-2.0

27,283

Llama-3.2-3B-Instruct-GGUF

NaNK

llama

26,526

DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit

NaNK

llama

26,327

Llama-3.2-1B-Instruct-GGUF

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. 16bit, 8bit, 6bit, 5bit, 4bit, 3bit and 2bit uploads avaliable. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing unsloth/Llama-3.2-1B-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK

llama

26,152

gemma-3-1b-it-unsloth-bnb-4bit

NaNK

—

25,755

gemma-3-270m-it-GGUF

—

25,431

128

DeepSeek-R1-GGUF

See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit. Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic 1. Do not forget about ` ` and ` ` tokens! - Or use a chat template formatter 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well: 3. It's best to use `--min-p 0.05` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. 4. Download the model via: 5. Example with Q40 K quantized cache Notice -no-cnv disables auto conversation mode 6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers. 7. If you want to merge the weights together, use this script: | MoE Bits | Type | Disk Size | Accuracy | Link | Details | | -------- | -------- | ------------ | ------------ | ---------------------| ---------- | | 1.58bit | UD-IQ1S | 131GB | Fair | Link | MoE all 1.56bit. `downproj` in MoE mixture of 2.06/1.56bit | | 1.73bit | UD-IQ1M | 158GB | Good | Link | MoE all 1.56bit. `downproj` in MoE left at 2.06bit | | 2.22bit | UD-IQ2XXS | 183GB | Better | Link | MoE all 2.06bit. `downproj` in MoE mixture of 2.5/2.06bit | | 2.51bit | UD-Q2KXL | 212GB | Best | Link | MoE all 2.5bit. `downproj` in MoE mixture of 3.5/2.5bit | Finetune your own Reasoning model like R1 with Unsloth! We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-GRPO.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the DeepSeek team for creating and releasing these models. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section. Post-Training: Large-Scale Reinforcement Learning on the Base Model - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-R1-Zero | 671B | 37B | 128K | 🤗 HuggingFace | | DeepSeek-R1 | 671B | 37B | 128K | 🤗 HuggingFace | DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. | Model | Base Model | Download | | :------------: | :------------: | :------------: | | DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 🤗 HuggingFace | |DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 🤗 HuggingFace | DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models. DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 | | | IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | 65.9 | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 | | | SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 | | | C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 | | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 | 5. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally. DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using vLLM: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. 7. License This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: - DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1. - DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license. - DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license. 9. Contact If you have any questions, please raise an issue or contact us at [email protected].

JanusCoder-8B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. 💻Github Repo • 🤗Model Collections • 📜Technical Report We introduce JanusCoder and JanusCoderV, a suite of open-source foundational models designed to establish a unified visual-programmatic interface for code intelligence. This model suite is built upon open-source language models (such as Qwen3-8B and 14B) and multimodal models (such as Qwen2.5-VL and InternVL3.5-8B). The JanusCoder series is trained on JANUSCODE-800K—the largest multimodal code corpus to date, generated by an innovative synthesis toolkit, covering everything from standard charts to complex interactive Web UIs and code-driven animations. This enables the models to uniformly handle diverse visual-programmatic tasks, such as generating code from textual instructions, visual inputs, or a combination of both, rather than building specialized models for isolated tasks. JanusCoder excels at flexible content generation (like data visualizations and interactive front-ends) as well as precise, program-driven editing of visual effects and complex animation construction. | Model Name | Description | Download | | --- | --- | --- | | 👉 JanusCoder-8B | 8B text model based on Qwen3-8B. | 🤗 Model | | JanusCoder-14B | 14B text model based on Qwen3-14B. | 🤗 Model | | JanusCoderV-7B | 7B multimodal model based on Qwen2.5-VL-7B. | 🤗 Model | | JanusCoderV-8B | 8B multimodal model based on InternVL3.5-8B. | 🤗 Model | We evaluate the JanusCoder model on various benchmarks that span code interlligence tasks on multiple PLs: | Model | JanusCoder-8B | Qwen3-8B | Qwen2.5-Coder-7B-Instruct | LLaMA3-8B-Instruct | GPT-4o | | --- | --- | --- | --- | --- | --- | | PandasPlotBench (Task) | 80 | 74 | 76 | 69 | 85 | | ArtifactsBench | 39.6 | 36.5 | 26.0 | 36.5 | 37.9 | | DTVBench (Manim) | 9.70 | 6.20 | 8.56 | 4.92 | 10.60 | | DTVBench (Wolfram) | 6.07 | 5.18 | 4.04 | 3.15 | 5.97 | The following provides demo code illustrating how to generate text using JanusCoder-8B. > Please use transformers >= 4.55.0 to ensure the model works normally. Citation 🫶 If you are interested in our work or find the repository / checkpoints / benchmark / data helpful, please consider using the following citation format when referencing our papers:

NaNK

license:apache-2.0

23,890

Qwen3-Coder-30B-A3B-Instruct-1M-GGUF

> [!NOTE] > Extends context length from 256K to 1 million > See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-Coder correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

Qwen3-4B-Base

Qwen2.5-VL-7B-Instruct-bnb-4bit

Qwen3-VL-8B-Instruct

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

15,530

Qwen2.5-VL-7B-Instruct

llama-3-8b-Instruct

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth! We have a Google Colab Tesla T4 notebook for Llama-3 8b here: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Gemma 7b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | Llama-2 7b | ▶️ Start on Colab | 2.2x faster | 43% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | CodeLlama 34b A100 | ▶️ Start on Colab | 1.9x faster | 27% less | | Mistral 7b 1xT4 | ▶️ Start on Kaggle | 5x faster\ | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK

llama

13,474

Phi-4-mini-instruct-GGUF

This is Phi-4-mini-instruct with our BUG FIXES. See our collection for versions of Phi-4 with our bug fixes including GGUF & 4-bit formats. Unsloth's Phi-4 Dynamic Quants is selectively quantized, greatly improving accuracy over standard 4-bit. Finetune your own Reasoning model like R1 with Unsloth! We have a free Google Colab notebook for turning Phi-4 into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi4(14B)-GRPO.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Phi-4 | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Unsloth bug fixes: 1. Padding and EOS tokens are the same - fixed this. 2. Chat template had extra EOS token - removed this. Otherwise you will be during inference. 3. EOS token should be not . Otherwise it'll terminate at 4. Changed unktoken to � from EOS. Phi-4-mini-instruct is a lightweight open model built upon synthetic data and filtered publicly available websites - with a focus on high-quality, reasoning dense data. The model belongs to the Phi-4 model family and supports 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning and direct preference optimization to support precise instruction adherence and robust safety measures. 📰 Phi-4-mini Microsoft Blog 📖 Phi-4-mini Technical Report 👩‍🍳 Phi Cookbook 🏡 Phi Portal 🖥️ Try It Azure, Huggingface Phi-4: [mini-instruct | onnx]; multimodal-instruct; The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require: 1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially math and logic). The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features. The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. This release of Phi-4-mini-instruct is based on valuable user feedback from the Phi-3 series. The Phi-4-mini model employed new architecture for efficiency, larger vocabulary for multilingual support, and better post-training techniques were used for instruction following, function calling, as well as additional data leading to substantial gains on key capabilities. It is anticipated that most use cases will benefit from this release, but users are encouraged to test in their particular AI applications. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4-mini-instruct is welcomed and crucial to the model’s evolution and improvement. To understand the capabilities, the 3.8B parameters Phi-4-mini-instruct model was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). A high-level overview of the model quality is as follows: | Benchmark | Similar size | | | | |2x size | | | | | | |----------------------------------|-------------|-------------------|-------------------|-------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| | | Phi-4 mini-Ins | Phi-3.5-mini-Ins | Llama-3.2-3B-Ins | Mistral-3B | Qwen2.5-3B-Ins | Qwen2.5-7B-Ins | Mistral-8B-2410 | Llama-3.1-8B-Ins | Llama-3.1-Tulu-3-8B | Gemma2-9B-Ins | GPT-4o-mini-2024-07-18 | | Popular aggregated benchmark | | | | | | | | | | | | | Arena Hard | 32.8 | 34.4 | 17.0 | 26.9 | 32.0 | 55.5 | 37.3 | 25.7 | 42.7 | 43.7 | 53.7 | | BigBench Hard (0-shot, CoT) | 70.4 | 63.1 | 55.4 | 51.2 | 56.2 | 72.4 | 53.3 | 63.4 | 55.5 | 65.7 | 80.4 | | MMLU (5-shot) | 67.3 | 65.5 | 61.8 | 60.8 | 65.0 | 72.6 | 63.0 | 68.1 | 65.0 | 71.3 | 77.2 | | MMLU-Pro (0-shot, CoT) | 52.8 | 47.4 | 39.2 | 35.3 | 44.7 | 56.2 | 36.6 | 44.0 | 40.9 | 50.1 | 62.8 | | Reasoning | | | | | | | | | | | | | ARC Challenge (10-shot) | 83.7 | 84.6 | 76.1 | 80.3 | 82.6 | 90.1 | 82.7 | 83.1 | 79.4 | 89.8 | 93.5 | | BoolQ (2-shot) | 81.2 | 77.7 | 71.4 | 79.4 | 65.4 | 80.0 | 80.5 | 82.8 | 79.3 | 85.7 | 88.7 | | GPQA (0-shot, CoT) | 25.2 | 26.6 | 24.3 | 24.4 | 23.4 | 30.6 | 26.3 | 26.3 | 29.9 | 39.1 | 41.1 | | HellaSwag (5-shot) | 69.1 | 72.2 | 77.2 | 74.6 | 74.6 | 80.0 | 73.5 | 72.8 | 80.9 | 87.1 | 88.7 | | OpenBookQA (10-shot) | 79.2 | 81.2 | 72.6 | 79.8 | 79.3 | 82.6 | 80.2 | 84.8 | 79.8 | 90.0 | 90.0 | | PIQA (5-shot) | 77.6 | 78.2 | 68.2 | 73.2 | 72.6 | 76.2 | 81.2 | 83.2 | 78.3 | 83.7 | 88.7 | | Social IQA (5-shot) | 72.5 | 75.1 | 68.3 | 73.9 | 75.3 | 75.3 | 77.6 | 71.8 | 73.4 | 74.7 | 82.9 | | TruthfulQA (MC2) (10-shot) | 66.4 | 65.2 | 59.2 | 62.9 | 64.3 | 69.4 | 63.0 | 69.2 | 64.1 | 76.6 | 78.2 | | Winogrande (5-shot) | 67.0 | 72.2 | 53.2 | 59.8 | 63.3 | 71.1 | 63.1 | 64.7 | 65.4 | 74.0 | 76.9 | | Multilingual | | | | | | | | | | | | | Multilingual MMLU (5-shot) | 49.3 | 51.8 | 48.1 | 46.4 | 55.9 | 64.4 | 53.7 | 56.2 | 54.5 | 63.8 | 72.9 | | MGSM (0-shot, CoT) | 63.9 | 49.6 | 44.6 | 44.6 | 53.5 | 64.5 | 56.7 | 56.7 | 58.6 | 75.1 | 81.7 | | Math | | | | | | | | | | | | | GSM8K (8-shot, CoT) | 88.6 | 76.9 | 75.6 | 80.1 | 80.6 | 88.7 | 81.9 | 82.4 | 84.3 | 84.9 | 91.3 | | MATH (0-shot, CoT) | 64.0 | 49.8 | 46.7 | 41.8 | 61.7 | 60.4 | 41.6 | 47.6 | 46.1 | 51.3 | 70.2 | | Overall | 63.5 | 60.5 | 56.2 | 56.9 | 60.1 | 67.9 | 60.2 | 62.3 | 60.9 | 65.0 | 75.5 | Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings. Phi-4-mini-instruct supports a vocabulary size of up to `200064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Given the nature of the training data, the Phi-4-mini-instruct model is best suited for prompts using specific formats. Below are the two primary formats: This format is used for general conversation and instructions: This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by and tokens. The tools should be specified in JSON format, using a JSON dump structure. Example: ` You are a helpful assistant with some tools. [{"name": "getweatherupdates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}] What is the weather like in Paris today? ` To perform inference using vLLM, you can use the following code snippet: Phi-4 family has been integrated in the `4.49.0` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. Phi-4-mini-instruct is also available in [Azure AI Studio]() After obtaining the Phi-4-mini-instruct model checkpoints, users can use this sample code for inference. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses. + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. + Architecture: Phi-4-mini-instruct has 3.8B parameters and is a dense decoder-only Transformer model. When compared with Phi-3.5-mini, the major changes with Phi-4-mini-instruct are 200K vocabulary, grouped-query attention, and shared input and output embedding. + Inputs: Text. It is best suited for prompts using the chat format. + Context length: 128K tokens + GPUs: 512 A100-80G + Training time: 21 days + Training data: 5T tokens + Outputs: Generated text in response to the input + Dates: Trained between November and December 2024 + Status: This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data. + Supported languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian + Release date: February 2025 Phi-4-mini’s training data includes a wide variety of sources, totaling 5 trillion tokens, and is a combination of 1) publicly available documents filtered for quality, selected high-quality educational data, and code 2) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.) 3) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for frontier models, but such information was removed to leave more model capacity for reasoning for the model’s small size. More details about data can be found in the Phi-4-mini-instruct technical report. The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis. A basic example of multi-GPUs supervised fine-tuning (SFT) with TRL and Accelerate modules is provided here. Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red team tested the model in English, Chinese, Japanese, Spanish, Portuguese, Arabic, Thai, and Russian for the following potential harms: Hate Speech and Bias, Violent Crimes, Specialized Advice, and Election Information. Their findings indicate that the model is resistant to jailbreak techniques across languages, but that language-specific attack prompts leveraging cultural context can cause the model to output harmful content. Another insight was that with function calling scenarios, the model could sometimes hallucinate function names or URL’s. The model may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken. Hardware Note that by default, the Phi-4-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA A6000 NVIDIA H100 If you want to run the model on: NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. There are, however, some exceptions to this. In some cases, we see a model that performs worse than expected on a given eval due to a failure to respect the output format. For example: + A model may refuse to answer questions (for no apparent reason), or in coding tasks models may prefix their response with “Sure, I can help with that. …” which may break the parser. In such cases, we have opted to try different system messages (e.g. “You must always respond to a question” or “Get to the point!”). + With some models, we observed that few shots actually hurt model performance. In this case we did allow running the benchmarks with 0-shots for all cases. + We have tools to convert between chat and completions APIs. When converting a chat prompt to a completion prompt, some models have different keywords e.g. Human vs User. In these cases, we do allow for model-specific mappings for chat to completion prompts. + Pick different few-shot examples. Few shots will always be the same when comparing different models. + Change prompt format: e.g. if it is an A/B/C/D multiple choice, we do not tweak this to 1/2/3/4 multiple choice. The model was evaluated across a breadth of public and internal benchmarks to understand the model’s capabilities under multiple tasks and conditions. While most evaluations use English, the leading multilingual benchmark was incorporated that covers performance in select languages. More specifically, + Reasoning: + Winogrande: commonsense reasoning around pronoun resolution + PIQA: physical commonsense reasoning around everyday situations + ARC-challenge: grade-school multiple choice science questions + GPQA: very hard questions written and validated by experts in biology, physics, and chemistry + MedQA: medical questions answering + Social IQA: social commonsense intelligence + BoolQ: natural questions from context + TruthfulQA: grounded reasoning + Language understanding: + HellaSwag: commonsense natural language inference around everyday events + ANLI: adversarial natural language inference + Function calling: + Berkeley function calling function and tool call + Internal function calling benchmarks + World knowledge: + TriviaQA: trivia question on general topics + Math: + GSM8K: grade-school math word problems + GSM8K Hard: grade-school math word problems with large values and some absurdity. + MATH: challenging competition math problems + Code: + HumanEval HumanEval+, MBPP, MBPP+: python coding tasks + LiveCodeBenh, LiveBench: contamination-free code tasks + BigCode Bench: challenging programming tasks + Spider: SQL query tasks + Internal coding benchmarks + Instructions following: + IFEval: verifiable instructions + Internal instructions following benchmarks + Multilingual: + MGSM: multilingual grade-school math + Multilingual MMLU and MMLU-pro + MEGA: multilingual NLP tasks + Popular aggregated datasets: MMLU, MMLU-pro, BigBench-Hard, AGI Eval + Multi-turn conversations: + Data generated by in-house adversarial conversation simulation tool + Single-turn trustworthiness evaluation: + DecodingTrust: a collection of trustworthiness benchmarks in eight different perspectives + XSTest: exaggerated safety evaluation + Toxigen: adversarial and hate speech detection + Red Team: + Responses to prompts provided by AI Red Team at Microsoft

license:mit

13,420

Llama-3.2-3B-bnb-4bit

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing unsloth/Llama-3.2-3B-bnb-4bit For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK

llama

13,198

csm-1b

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune TTS models for free using our Google Colab notebooks here! - Read our Blog about TTS support: unsloth.ai/blog/tts | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Sesame-CSM-1B | ▶️ Start on Colab-TTS.ipynb) | 1.5x faster | 58% less | | Whisper Large V3 | ▶️ Start on Colab | 1.5x faster | 50% less | | Qwen3 (14B) | ▶️ Start on Colab | 2x faster | 70% less | | Llama 3.2 Vision (11B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 50% less | 2025/03/13 - We are releasing the 1B CSM variant. Code is available on GitHub: SesameAILabs/csm. CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes. A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post. A hosted HuggingFace space is also available for testing audio generation. CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance. The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice. CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation. The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well. This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following: - Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent. - Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls. - Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes. By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology. Authors Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

NaNK

license:apache-2.0

12,793

Nanonets-OCR-s-GGUF

—

12,750

Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit

NaNK

license:apache-2.0

12,328

DeepSeek-R1-Distill-Qwen-32B-GGUF

See our collection for versions of Deepseek-R1 including GGUF and original formats. Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseek-r1 1. Do not forget about ` ` and ` ` tokens! - Or use a chat template formatter 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp 3. Example with Q80 K quantized cache Notice -no-cnv disables auto conversation mode 4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers. Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the DeepSeek team for creating and releasing these models. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section. Post-Training: Large-Scale Reinforcement Learning on the Base Model - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-R1-Zero | 671B | 37B | 128K | 🤗 HuggingFace | | DeepSeek-R1 | 671B | 37B | 128K | 🤗 HuggingFace | DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. | Model | Base Model | Download | | :------------: | :------------: | :------------: | | DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 🤗 HuggingFace | |DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 🤗 HuggingFace | DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models. DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 | | | IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | 65.9 | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 | | | SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 | | | C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 | | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 | 5. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally. DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using vLLM: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. 7. License This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: - DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1. - DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license. - DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license. 9. Contact If you have any questions, please raise an issue or contact us at [email protected].

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth! All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 7b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | Llama-2 7b | ▶️ Start on Colab | 2.2x faster | 43% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | CodeLlama 34b A100 | ▶️ Start on Colab | 1.9x faster | 27% less | | Mistral 7b 1xT4 | ▶️ Start on Kaggle | 5x faster\ | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

llama

11,612

Llama-3.2-3B

Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing Llama-3.2-3B For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | We introduce the updated version of the Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-235B-A22B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Claude Opus 4 Non-thinking | Kimi K2 | Qwen3-235B-A22B Non-thinking | Qwen3-235B-A22B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | ---| | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 86.6 | 81.1 | 75.2 | 83.0 | | MMLU-Redux | 90.4 | 91.3 | 94.2 | 92.7 | 89.2 | 93.1 | | GPQA | 68.4 | 66.9 | 74.9 | 75.1 | 62.9 | 77.5 | | SuperGPQA | 57.3 | 51.0 | 56.5 | 57.2 | 48.2 | 62.6 | | SimpleQA | 27.2 | 40.3 | 22.8 | 31.0 | 12.2 | 54.3 | | CSimpleQA | 71.1 | 60.2 | 68.0 | 74.5 | 60.8 | 84.3 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 33.9 | 49.5 | 24.7 | 70.3 | | HMMT25 | 27.5 | 7.9 | 15.9 | 38.8 | 10.0 | 55.4 | | ARC-AGI | 9.0 | 8.8 | 30.3 | 13.3 | 4.3 | 41.8 | | ZebraLogic | 83.4 | 52.6 | - | 89.0 | 37.7 | 95.0 | | LiveBench 20241125 | 66.9 | 63.7 | 74.6 | 76.4 | 62.5 | 75.4 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 44.6 | 48.9 | 32.9 | 51.8 | | MultiPL-E | 82.2 | 82.7 | 88.5 | 85.7 | 79.3 | 87.9 | | Aider-Polyglot | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 87.4 | 89.8 | 83.2 | 88.7 | | Arena-Hard v2 | 45.6 | 61.9 | 51.5 | 66.1 | 52.0 | 79.2 | | Creative Writing v3 | 81.6 | 84.9 | 83.8 | 88.1 | 80.4 | 87.5 | | WritingBench | 74.5 | 75.5 | 79.2 | 86.2 | 77.0 | 85.2 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 | | TAU-Retail | 49.6 | 60.3# | 81.4 | 70.7 | 65.2 | 71.3 | | TAU-Airline | 32.0 | 42.8# | 59.6 | 53.5 | 32.0 | 44.0 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | - | 76.2 | 70.2 | 77.5 | | MMLU-ProX | 75.8 | 76.2 | - | 74.5 | 73.2 | 79.4 | | INCLUDE | 80.1 | 82.1 | - | 76.9 | 75.6 | 79.5 | | PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

Qwen2.5-14B-Instruct-bnb-4bit

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 14B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 14.7B - Number of Paramaters (Non-Embedding): 13.1B - Number of Layers: 48 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

10,783

Qwen2.5-Coder-7B-bnb-4bit

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). For Qwen2.5-Coder, we release three base language models and instruction-tuned language models, 1.5, 7 and 32 (coming soon) billion parameters. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the 7B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., or fill in the middle tasks on this model. For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

10,730

DeepSeek-R1-Distill-Llama-70B-GGUF

Qwen3-235B-A22B-Thinking-2507-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-2507 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK

license:apache-2.0

10,319

Qwen2.5-3B-Instruct

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 3B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK

—

9,967

Llama-3.1-8B-Instruct-GGUF

See our collection for versions of Llama 3.1 including 4-bit & 16-bit formats. Unsloth Dynamic v2.0 achieves superior accuracy & outperforms other leading quant methods. - Read our Blog about Llama 3.1 fine-tuning support: unsloth.ai/blog/llama4 - View the rest of our fine-tuning notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp, vLLM or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Llama 3.1 (8B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.1 family of models. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here. Intended Use Cases Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card. Note : Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner. This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at `huggingface-llama-recipes` LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023. In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. Responsibility & Safety As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Responsible deployment Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.1 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Llama 3.1 instruct Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Llama 3.1 systems Large language models, including Llama 3.1, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. New capabilities Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. Evaluations We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. Red teaming For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. Critical and other risks We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. 2. Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3. Cyber attack enablement Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Our study of Llama-3.1-405B’s social engineering uplift for cyber attackers was conducted to assess the effectiveness of AI models in aiding cyber threat actors in spear phishing campaigns. Please read our Llama 3.1 Cyber security whitepaper to learn more. Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Ethical Considerations and Limitations The core values of Llama 3.1 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.1 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.1 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.1’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.1 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

NaNK

llama

9,922

Qwen2.5-Omni-7B-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-Coder correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct. featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platforms such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-480B-A35B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 480B in total and 35B activated - Number of Layers: 62 - Number of Attention Heads (GQA): 96 for Q and 8 for KV - Number of Experts: 160 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-480B-A35B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

medgemma-4b-it-unsloth-bnb-4bit

NaNK

—

8,912

Qwen2.5-VL-3B-Instruct-GGUF

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MMMU val | 52.3 | 54.1 | 53.1| | MMMU-Pro val | 32.7 | 30.5 | 31.6| | AI2D test | 81.4 | 83.0 | 81.5 | | DocVQA test | 91.6 | 94.5 | 93.9 | | InfoVQA test | 72.1 | 76.5 | 77.1 | | TextVQA val | 76.8 | 84.3 | 79.3| | MMBench-V1.1 test | 79.3 | 80.7 | 77.6 | | MMStar | 58.3 | 60.7 | 55.9 | | MathVista testmini | 60.5 | 58.2 | 62.3 | | MathVision full | 20.9 | 16.3 | 21.2 | Video benchmark | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MVBench | 71.6 | 67.0 | 67.0 | | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 | | MLVU | 48.3 | - | 68.2 | | LVBench | - | - | 43.3 | | MMBench-Video | 1.73 | 1.44 | 1.63 | | EgoSchema | - | - | 64.8 | | PerceptionTest | - | - | 66.9 | | TempCompass | - | - | 64.4 | | LongVideoBench | 55.2 | 55.6 | 54.2 | | CharadesSTA/mIoU | - | - | 38.8 | Agent benchmark | Benchmarks | Qwen2.5-VL-3B | |-------------------------|---------------| | ScreenSpot | 55.5 | | ScreenSpot Pro | 23.9 | | AITZEM | 76.9 | | Android Control HighEM | 63.7 | | Android Control LowEM | 22.2 | | AndroidWorldSR | 90.8 | | MobileMiniWob++SR | 67.9 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.

DeepSeek-V3.1-GGUF

Learn how to run DeepSeek-V3.1 correctly - Read our Guide . See how DeepSeek-V3.1 Dynamic 3-bit GGUF scores 75.6% on Aider Polyglot here . These quants include our Unsloth chat template fixes, specifically for llama.cpp supported backends. - You must use --jinja for llama.cpp quants - Set the temperature ~0.6 (recommended) and TopP value of 0.95 (recommended) - UD-Q2KXL (247GB) is recommended - For complete detailed instructions, see our guide: unsloth.ai/blog/deepseek-v3.1 DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].

license:mit

8,565

Qwen2.5-1.5B-Instruct

NaNK

license:apache-2.0

8,500

whisper-large-v3

license:apache-2.0

8,380

Magistral-Small-2509

license:apache-2.0

7,412

Phi-4-mini-reasoning-GGUF

See our collection for all versions of Phi-4 including GGUF, 4-bit & 16-bit formats. Learn to run Phi-4 reasoning correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Phi-4 (14B) for free using our Google Colab notebook here! - Read our Blog about Phi-4 support with our bug fixes: unsloth.ai/blog/phi4 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 80% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Phi-4-mini-reasoning is a lightweight open model built upon synthetic data with a focus on high-quality, reasoning dense data further finetuned for more advanced math reasoning capabilities. The model belongs to the Phi-4 model family and supports 128K token context length. 📰 Phi-4-mini-reasoning Blog, and Developer Article 📖 Phi-4-mini-reasoning Technical Report 👩‍🍳 Phi Cookbook 🏡 Phi Portal 🖥️ Try It Azure 🎉Phi-4 models: [Phi-4-reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx] Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks under memory/compute constrained environments and latency bound scenarios. Some of the use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and delivering accurate, reliable solutions in domains that require deep analytical thinking. This model is designed and tested for math reasoning only. It is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. This release of Phi-4-mini-reasoning addresses user feedback and market demand for a compact reasoning model. It is a compact transformer-based language model optimized for mathematical reasoning, built to deliver high-quality, step-by-step problem solving in environments where computing or latency is constrained. The model is fine-tuned with synthetic math data from a more capable model (much larger, smarter, more accurate, and better at following instructions), which has resulted in enhanced reasoning performance. Phi-4-mini-reasoning balances reasoning ability with efficiency, making it potentially suitable for educational applications, embedded tutoring, and lightweight deployment on edge or mobile systems. If a critical issue is identified with Phi-4-mini-reasoning, it should be promptly reported through the MSRC Researcher Portal or [email protected] To understand the capabilities, the 3.8B parameters Phi-4-mini-reasoning model was compared with a set of models over a variety of reasoning benchmarks. A high-level overview of the model quality is as follows: | Model | AIME | MATH-500 | GPQA Diamond | |------------------------------------|-------|----------|--------------| | o1-mini | 63.6 | 90.0 | 60.0 | | DeepSeek-R1-Distill-Qwen-7B | 53.3 | 91.4 | 49.5 | | DeepSeek-R1-Distill-Llama-8B | 43.3 | 86.9 | 47.3 | | Bespoke-Stratos-7B | 20.0 | 82.0 | 37.8 | | OpenThinker-7B | 31.3 | 83.0 | 42.4 | | Llama-3.2-3B-Instruct | 6.7 | 44.4 | 25.3 | | Phi-4-Mini (base model, 3.8B) | 10.0 | 71.8 | 36.9 | |Phi-4-mini-reasoning (3.8B) | 57.5 | 94.6 | 52.0 | Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings. Phi-4-mini-reasoning supports a vocabulary size of up to `200064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Given the nature of the training data, the Phi-4-mini-instruct model is best suited for prompts using specific formats. Below are the two primary formats: This format is used for general conversation and instructions: Phi-4-mini-reasoning has been integrated in the `4.51.3` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. Python 3.8 and 3.10 will work best. List of required packages: Phi-4-mini-reasoning is also available in Azure AI Studio After obtaining the Phi-4-mini-instruct model checkpoints, users can use this sample code for inference. + Architecture: Phi-4-mini-reasoning shares the same architecture as Phi-4-Mini, which has 3.8B parameters and is a dense decoder-only Transformer model. When compared with Phi-3.5-Mini, the major changes with Phi-4-Mini are 200K vocabulary, grouped-query attention, and shared input and output embedding. + Inputs: Text. It is best suited for prompts using the chat format. + Context length: 128K tokens + GPUs: 128 H100-80G + Training time: 2 days + Training data: 150B tokens + Outputs: Generated text + Dates: Trained in February 2024 + Status: This is a static model trained on offline datasets with the cutoff date of February 2025 for publicly available data. + Supported languages: English + Release date: April 2025 The training data for Phi-4-mini-reasoning consists exclusively of synthetic mathematical content generated by a stronger and more advanced reasoning model, Deepseek-R1. The objective is to distill knowledge from this model. This synthetic dataset comprises over one million diverse math problems spanning multiple levels of difficulty (from middle school to Ph.D. level). For each problem in the synthetic dataset, eight distinct solutions (rollouts) were sampled, and only those verified as correct were retained, resulting in approximately 30 billion tokens of math content. The dataset integrates three primary components: 1) a curated selection of high-quality, publicly available math questions and a part of the SFT(Supervised Fine-Tuning) data that was used to train the base Phi-4-Mini model; 2) an extensive collection of synthetic math data generated by the Deepseek-R1 model, designed specifically for high-quality supervised fine-tuning and model distillation; and 3) a balanced set of correct and incorrect answers used to construct preference data aimed at enhancing Phi-4-mini-reasoning's reasoning capabilities by learning more effective reasoning trajectories Hardware Note that by default, the Phi-4-mini-reasoning model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA H100 If you want to run the model on: NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT, DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. Phi-4-Mini-Reasoning was developed in accordance with Microsoft's responsible AI principles. Potential safety risks in the model’s responses were assessed using the Azure AI Foundry’s Risk and Safety Evaluation framework, focusing on harmful content, direct jailbreak, and model groundedness. The Phi-4-Mini-Reasoning Model Card contains additional information about our approach to safety and responsible AI considerations that developers should be aware of when using this model. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Election Information Reliability : The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region. + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses. + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. For all benchmarks, we consider using the same generation configuration such as max sequence length (32768), the same temperature for the fair comparison. Benchmark datasets We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically: - Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving. - AIME 2024: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. - GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.

license:mit

7,402

orpheus-3b-0.1-ft-unsloth-bnb-4bit

NaNK

llama

7,386

Qwen3-VL-2B-Instruct-unsloth-bnb-4bit

NaNK

license:apache-2.0

7,358

medgemma-4b-it-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model on Google Cloud Model Garden: MedGemma Model on Hugging Face: MedGemma GitHub repository (supporting code, Colab notebooks, discussions, and issues): MedGemma Quick start notebook: GitHub Fine-tuning notebook: GitHub Concept applications built using MedGemma: Collection Support: See Contact License: The use of MedGemma is governed by the Health AI Developer Foundations terms of use. This section describes the MedGemma model and how to use it. MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in three variants: a 4B multimodal version and 27B text-only and multimodal versions. Both MedGemma multimodal versions utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. Their LLM components are trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma 4B is available in both pre-trained (suffix: `-pt`) and instruction-tuned (suffix `-it`) versions. The instruction-tuned version is a better starting point for most applications. The pre-trained version is available for those who want to experiment more deeply with the models. MedGemma 27B multimodal has pre-training on medical image, medical record and medical record comprehension tasks. MedGemma 27B text-only has been trained exclusively on medical text. Both models have been optimized for inference-time computation on medical reasoning. This means it has slightly higher performance on some text benchmarks than MedGemma 27B multimodal. Users who want to work with a single model for both medical text, medical record and medical image tasks are better suited for MedGemma 27B multimodal. Those that only need text use-cases may be better served with the text-only variant. Both MedGemma 27B variants are only available in instruction-tuned versions. MedGemma variants have been evaluated on a range of clinically relevant benchmarks to illustrate their baseline performance. These evaluations are based on both open benchmark datasets and curated datasets. Developers can fine-tune MedGemma variants for improved performance. Consult the Intended Use section below for more details. MedGemma is optimized for medical applications that involve a text generation component. For medical image-based applications that do not involve text generation, such as data-efficient classification, zero-shot classification, or content-based or semantic image retrieval, the MedSigLIP image encoder is recommended. MedSigLIP is based on the same image encoder that powers MedGemma. Please consult the MedGemma Technical Report for more details. Below are some example code snippets to help you quickly get started running the model locally on GPU. If you want to use the model at scale, we recommend that you create a production version using Model Garden. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. See the following Colab notebooks for examples of how to use MedGemma: To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab. Note that you will need to use Colab Enterprise to obtain adequate GPU resources to run either 27B model without quantization. For an example of fine-tuning the 4B model, see the Fine-tuning notebook in Colab. The 27B models can be fine tuned in a similar manner but will require more time and compute resources than the 4B model. The MedGemma model is built based on Gemma 3 and uses the same decoder-only transformer architecture as Gemma 3\. To read more about the architecture, consult the Gemma 3 model card. Model type: Decoder-only Transformer architecture, see the Gemma 3 Technical Report Input Modalities: Text, vision Output Modality: Text only Attention mechanism: Grouped-query attention (GQA) Context length: Supports long context, at least 128K tokens Key publication: https://arxiv.org/abs/2507.05201 Model created: July 9, 2025 When using this model, please cite: Sellergren et al. "MedGemma Technical Report." arXiv preprint arXiv:2507.05201 (2025). Text string, such as a question or prompt Images, normalized to 896 x 896 resolution and encoded to 256 tokens each Total input length of 128K tokens Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document Total output length of 8192 tokens MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks. The multimodal performance of MedGemma 4B and 27B multimodal was evaluated across a range of benchmarks, focusing on radiology, dermatology, histopathology, ophthalmology, and multimodal clinical reasoning. MedGemma 4B outperforms the base Gemma 3 4B model across all tested multimodal health benchmarks. | Task and metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | Medical image classification | | | | MIMIC CXR\\ \- macro F1 for top 5 conditions | 81.2 | 88.9 | | CheXpert CXR \- macro F1 for top 5 conditions | 32.6 | 48.1 | | CXR14 \- macro F1 for 3 conditions | 32.0 | 50.1 | | PathMCQA\ (histopathology, internal\\) \- Accuracy | 37.1 | 69.8 | | US-DermMCQA\ \- Accuracy | 52.5 | 71.8 | | EyePACS\ (fundus, internal) \- Accuracy | 14.4 | 64.9 | | Visual question answering | | | | SLAKE (radiology) \- Tokenized F1 | 40.2 | 72.3 | | VQA-RAD\\\ (radiology) \- Tokenized F1 | 33.6 | 49.9 | | Knowledge and reasoning | | | | | | MedXpertQA (text \+ multimodal questions) \- Accuracy | 16.4 | 18.8 | Internal datasets. US-DermMCQA is described in Liu (2020, Nature medicine), presented as a 4-way MCQ per example for skin condition classification. PathMCQA is based on multiple datasets, presented as 3-9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer. EyePACS is a dataset of fundus images with classification labels based on 5-level diabetic retinopathy severity (None, Mild, Moderate, Severe, Proliferative). More details in the MedGemma Technical Report. Based on radiologist adjudicated labels, described in Yang (2024, arXiv) Section A.1.1. Based on "balanced split," described in Yang (2024, arXiv). MedGemma chest X-ray (CXR) report generation performance was evaluated on MIMIC-CXR using the RadGraph F1 metric. We compare the MedGemma pre-trained checkpoint with our previous best model for CXR report generation, PaliGemma 2. | Metric | MedGemma 4B (pre-trained) | MedGemma 4B (tuned for CXR)| PaliGemma 2 3B (tuned for CXR) | PaliGemma 2 10B (tuned for CXR) | | :---- | :---- | :---- | :---- | :---- | | MIMIC CXR \- RadGraph F1 | 29.5 | 30.3 |28.8 | 29.5 | The instruction-tuned versions of MedGemma 4B and MedGemma 27B achieve lower scores (21.9 and 21.3, respectively) due to the differences in reporting style compared to the MIMIC ground truth reports. Further fine-tuning on MIMIC reports enables users to achieve improved performance, as shown by the improved performance of the MedGemma 4B model that was tuned for CXR. MedGemma 4B and text-only MedGemma 27B were evaluated across a range of text-only benchmarks for medical knowledge and reasoning. The MedGemma models outperform their respective base Gemma models across all tested text-only health benchmarks. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | MedQA (4-op) | 50.7 | 64.4 | | MedMCQA | 45.4 | 55.7 | | PubMedQA | 68.4 | 73.4 | | MMLU Med | 67.2 | 70.0 | | MedXpertQA (text only) | 11.6 | 14.2 | | AfriMed-QA (25 question test set) | 48.0 | 52.0 | For all MedGemma 27B results, test-time scaling is used to improve performance. All models were evaluated on a question answer dataset from synthetic FHIR data to answer questions about patient records. MedGemma 27B multimodal's FHIR-specific training gives it significant improvement over other MedGemma and Gemma models. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | EHRQA | 70.9 | 67.6 | Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma 4B utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Its LLM component is trained on a diverse set of medical data, including medical text relevant to radiology images, chest-x rays, histopathology patches, ophthalmology images and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 5 different tasks and 6 medical image modalities. These include both open benchmark datasets and curated datasets, with a focus on expert human evaluations for tasks like CXR report generation and radiology VQA. Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma multimodal variants utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Their LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 6 different tasks and 4 medical image modalities. These benchmarks include both open and internal datasets. MedGemma utilizes a combination of public and private datasets. This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), ChestImaGenome: Set of bounding boxes linking image findings with anatomical regions for MIMIC-CXR (MedGemma 27B multimodal only), SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays). Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next). MIMIC-CXR: MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC). Slake-VQA: The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital. PAD-UFES-20: Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD). SCIN: A collaboration between Google Health and Stanford Medicine. TCGA (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC) CAMELYON: The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands. PMC-OA (PubMed Central Open Access Subset): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH. MedQA: This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits Mendeley Digital Knee X-Ray: This dataset is from Rani Channamma University, and is hosted on Mendeley Data. AfriMed-QA: This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP. VQA-RAD: This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health) Chest ImaGenome: IBM Research. MedExpQA: This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence). MedXpertQA: This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China). HealthSearchQA: This dataset consists of consisting of 3,173 commonly searched consumer questions In addition to the public datasets listed above, MedGemma was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants. Radiology dataset 1: De-identified dataset of different CT studies across body parts from a US-based radiology outpatient diagnostic center network. Ophthalmology dataset 1 (EyePACS): De-identified dataset of fundus images from diabetic retinopathy screening. Dermatology dataset 1: De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia. Dermatology dataset 2: De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia. Dermatology dataset 3: De-identified dataset of non-diseased skin images from an internal data collection effort. Pathology dataset 1: De-identified dataset of histopathology H\&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes. Pathology dataset 2: De-identified dataset of lung histopathology H\&E and IHC whole slide images created by a commercial biobank in the United States. Pathology dataset 3: De-identified dataset of prostate and lymph node H\&E and IHC histopathology whole slide images created by a contract research organization in the United States. Pathology dataset 4: De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H\&E. EHR dataset 1: Question/answer dataset drawn from synthetic FHIR records created by Synthea. The test set includes 19 unique patients with 200 questions per patient divided into 10 different categories. MIMIC-CXR: Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://physionet.org/content/mimic-cxr/2.1.0/ and Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019\. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." Scientific Data 6 (1): 1–8. SLAKE: Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." http://arxiv.org/abs/2102.09542. PAD-UEFS-20: Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." Data in brief 32 (2020): 106221\. SCIN: Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024\. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." JAMA Network Open 7 (11): e2446615–e2446615. TCGA: The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. CAMELYON16: Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017\. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." JAMA 318 (22): 2199–2210. Mendeley Digital Knee X-Ray: Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1 VQA-RAD: Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018\. "A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images." Scientific Data 5 (1): 1–10. Chest ImaGenome: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. RRID:SCR\007345. https://doi.org/10.13026/wv01-y230 MedQA: Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020\. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." http://arxiv.org/abs/2009.13081. AfrimedQA: Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024\. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." http://arxiv.org/abs/2411.15640. MedExpQA: Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. arXiv preprint arXiv:2404.05590. Retrieved from https://arxiv.org/abs/2404.05590 MedXpertQA: Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025\. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." http://arxiv.org/abs/2501.18362. Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions. MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in any medical context (image and textual), however the model was pre-trained using chest X-ray, pathology, dermatology, and fundus images. Examples of tasks within MedGemma's training include visual question answering pertaining to medical images, such as radiographs, or providing answers to textual medical questions. Full details of all the tasks MedGemma has been evaluated can be found in the MedGemma Technical Report. Provides strong baseline medical image and text comprehension for models of its size. This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training. This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics. MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities on relevant benchmarks, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies. MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images. MedGemma has not been evaluated or optimized for multi-turn applications. MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3\. When adapting MedGemma developer should consider the following: Bias in validation data: As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc). Data contamination concerns: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk. May 20, 2025: Initial Release July 9, 2025 Bug Fix: Fixed the subtle degradation in the multimodal performance. The issue was due to a missing end-of-image token in the model vocabulary, impacting combined text-and-image tasks. This fix reinstates and correctly maps that token, ensuring text-only tasks remain unaffected while restoring multimodal performance.

NaNK

—

7,247

Mistral-Small-Instruct-2409-bnb-4bit

NaNK

—

7,224

Phi-4-reasoning-plus-GGUF

license:mit

7,173

gemma-2-2b-bnb-4bit

NaNK

—

7,004

gemma-3n-E4B-it

NaNK

—

6,979

Meta-Llama-3.1-70B-Instruct-bnb-4bit

Llama-3.3-70B-Instruct

See our collection for all versions of Llama 3.3 including GGUF, 4-bit and original 16-bit formats. Finetune Llama 3.3, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb unsloth/Llama-3.3-70B-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. | | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Llama 3.3 (text only) | A new mix of publicly available online data. | 70B | Multilingual Text | Multilingual Text and code | 128k | Yes | 15T+ | December 2023 | Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.3 model. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license, the Llama 3.3 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3\3/LICENSE Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.3 in applications, please go here. Intended Use Cases Llama 3.3 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.3 model also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.3 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.3 Community License. Use in languages beyond those explicitly referenced as supported in this model card\\. \\Note: Llama 3.3 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.3 models for languages beyond the 8 supported languages provided they comply with the Llama 3.3 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.3 in additional languages is done in a safe and responsible manner. This repository contains two versions of Llama-3.3-70B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. LLaMA-3.3 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. The model checkpoints can be used in `8-bit` and `4-bit` for further memory optimisations using `bitsandbytes` and `transformers` To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training Energy Use Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. | | Training Time (GPU hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) | | :---- | :---: | :---: | :---: | :---: | | Llama 3.3 70B | 7.0M | 700 | 2,040 | 0 | The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.3 was pretrained on \~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023\. In this section, we report the results for Llama 3.3 relative to our previous models. | Category | Benchmark | \# Shots | Metric | Llama 3.1 8B Instruct | Llama 3.1 70B Instruct | Llama-3.3 70B Instruct | Llama 3.1 405B Instruct | | :---- | :---- | ----- | :---- | ----- | ----- | ----- | ----- | | | MMLU (CoT) | 0 | macro\avg/acc | 73.0 | 86.0 | 86.0 | 88.6 | | | MMLU Pro (CoT) | 5 | macro\avg/acc | 48.3 | 66.4 | 68.9 | 73.3 | | Steerability | IFEval | | | 80.4 | 87.5 | 92.1 | 88.6 | | Reasoning | GPQA Diamond (CoT) | 0 | acc | 31.8 | 48.0 | 50.5 | 49.0 | | Code | HumanEval | 0 | pass@1 | 72.6 | 80.5 | 88.4 | 89.0 | | | MBPP EvalPlus (base) | 0 | pass@1 | 72.8 | 86.0 | 87.6 | 88.6 | | Math | MATH (CoT) | 0 | sympy\intersection\score | 51.9 | 68.0 | 77.0 | 73.8 | | Tool Use | BFCL v2 | 0 | overall\ast\summary/macro\avg/valid | 65.4 | 77.5 | 77.3 | 81.1 | | Multilingual | MGSM | 0 | em | 68.9 | 86.9 | 91.1 | 91.6 | Responsibility & Safety As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Responsible deployment Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.3 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Llama 3.3 instruct Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Llama 3.3 systems Large language models, including Llama 3.3, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. New capabilities Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.3 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. Evaluations We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. Red teaming For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. . Critical and other risks We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of the Llama 3.3 model could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. 2\. Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3\. Cyber attack enablement Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Ethical Considerations and Limitations The core values of Llama 3.3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.3 model, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

Qwen3-30B-A3B-Instruct-2507

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

5,291

Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

NaNK

license:apache-2.0

5,167

Meta-Llama-3.1-70B-bnb-4bit

NaNK

llama

5,123

Qwen2.5-Coder-7B-Instruct-128K-GGUF

NaNK

license:apache-2.0

5,109

Qwen2.5-Omni-3B-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-3B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

NaNK

—

5,014

Qwen2.5-7B-unsloth-bnb-4bit

NaNK

license:apache-2.0

4,953

Llama-3.2-3B-unsloth-bnb-4bit

NaNK

llama

4,906

embeddinggemma-300m-GGUF

—

4,854

Qwen3-VL-235B-A22B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

Mistral-Small-24B-Instruct-2501-GGUF

NaNK

license:apache-2.0

4,312

Qwen3-VL-32B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

4,250

Qwen2.5-VL-3B-Instruct-bnb-4bit

NaNK

—

4,229

Magistral-Small-2506-GGUF

license:apache-2.0

4,227

Qwen3-VL-235B-A22B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

4,227

gemma-3-27b-it-qat-GGUF

NaNK

—

4,213

Qwen3-VL-4B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

Phi-4-mini-instruct-bnb-4bit

NaNK

license:mit

3,024

Qwen3-30B-A3B

NaNK

—

3,005

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK

license:apache-2.0

2,826

Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit

NaNK

llama4

2,821

medgemma-27b-it-GGUF

NaNK

—

2,820

Phi-4-mini-instruct

license:mit

2,816

Seed-Coder-8B-Reasoning-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Introduction We are thrilled to introduce Seed-Coder, a powerful, transparent, and parameter-efficient family of open-source code models at the 8B scale, featuring base, instruct, and reasoning variants. Seed-Coder contributes to promote the evolution of open code models through the following highlights. - Model-centric: Seed-Coder predominantly leverages LLMs instead of hand-crafted rules for code data filtering, minimizing manual effort in pretraining data construction. - Transparent: We openly share detailed insights into our model-centric data pipeline, including methods for curating GitHub data, commits data, and code-related web data. - Powerful: Seed-Coder achieves state-of-the-art performance among open-source models of comparable size across a diverse range of coding tasks. This repo contains the Seed-Coder-8B-Reasoning model, which has the following features: - Type: Causal language models - Training Stage: Pretraining & Post-training - Data Source: Public datasets - Context Length: 65,536 Model Downloads | Model Name | Length | Download | Notes | |---------------------------------------------------------|-----------|------------------------------------|-----------------------| | Seed-Coder-8B-Base | 32K | 🤗 Model | Pretrained on our model-centric code data. | | Seed-Coder-8B-Instruct | 32K | 🤗 Model | Instruction-tuned for alignment with user intent. | | 👉 Seed-Coder-8B-Reasoning | 64K | 🤗 Model | RL trained to boost reasoning capabilities. | | Seed-Coder-8B-Reasoning-bf16 | 64K | 🤗 Model | RL trained to boost reasoning capabilities. | Requirements You will need to install the latest versions of `transformers` and `accelerate`: Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face `pipeline` API: Evaluation Seed-Coder-8B-Reasoning strikes impressive performance on competitive programming, demonstrating that smaller LLMs can also be competent on complex reasoning tasks. Our model surpasses QwQ-32B and DeepSeek-R1 on IOI'2024, and achieves an ELO rating comparable to o1-mini on Codeforces contests. For detailed benchmark performance, please refer to our 📑 Technical Report. This project is licensed under the MIT License. See the LICENSE file for details.

NaNK

llama

2,766

gemma-2-9b-it

NaNK

—

2,685

GLM-4.5-GGUF

📍 Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK

license:mit

2,667

Nanonets-OCR-s

—

2,663

Qwen2.5-Coder-14B-Instruct-bnb-4bit

gemma-3-270m

—

1,980

Qwen2.5-Coder-1.5B-Instruct

NaNK

license:apache-2.0

1,967

Qwen3-VL-4B-Thinking-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

Mistral-Nemo-Instruct-2407

license:apache-2.0

1,795

codellama-7b-bnb-4bit

Qwen3-VL-4B-Thinking-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

1,619

Qwen3-4B-Thinking-2507-bnb-4bit

NaNK

license:apache-2.0

1,617

Qwen3-30B-A3B-Base

NaNK

license:apache-2.0

1,613

Qwen3-VL-2B-Instruct-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

1,601

Hunyuan-A13B-Instruct-GGUF

Mistral-Nemo-Base-2407

license:apache-2.0

1,051

dots.llm1.inst-GGUF

license:mit

1,022

Seed-Coder-8B-Instruct-GGUF

NaNK

llama

1,016

Qwen2.5-Math-1.5B-Instruct-bnb-4bit

NaNK

license:apache-2.0

1,003

LFM2-1.2B-GGUF

license:apache-2.0

857

Hermes-4-405B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Hermes 4 405B is a frontier, hybrid-mode reasoning model based on Llama-3.1-405B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, interleaved with its reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-405B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-405B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B and 14B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next. Qwen3-Next-80B-A3B is the first installment in the Qwen3-Next series and features the following key enchancements: - Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length. - High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity. - Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training. - Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference. We are seeing strong performance in terms of both parameter efficiency and inference speed for Qwen3-Next-80B-A3B: - Qwen3-Next-80B-A3B-Base outperforms Qwen3-32B-Base on downstream tasks with 10% of the total training cost and with 10 times inference throughput for context over 32K tokens. - Qwen3-Next-80B-A3B-Instruct performs on par with Qwen3-235B-A22B-Instruct-2507 on certain benchmarks, while demonstrating significant advantages in handling ultra-long-context tasks up to 256K tokens. For more details, please refer to our blog post Qwen3-Next. > [!Note] > Qwen3-Next-80B-A3B-Instruct supports only instruct (non-thinking) mode and does not generate `` `` blocks in its output. Qwen3-Next-80B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining (15T tokens) & Post-training - Number of Parameters: 80B in total and 3B activated - Number of Paramaters (Non-Embedding): 79B - Hidden Dimension: 2048 - Number of Layers: 48 - Hybrid Layout: 12 \ (3 \ (Gated DeltaNet -> MoE) -> 1 \ (Gated Attention -> MoE)) - Gated Attention: - Number of Attention Heads: 16 for Q and 2 for KV - Head Dimension: 256 - Rotary Position Embedding Dimension: 64 - Gated DeltaNet: - Number of Linear Attention Heads: 32 for V and 16 for QK - Head Dimension: 128 - Mixture of Experts: - Number of Experts: 512 - Number of Activated Experts: 10 - Number of Shared Experts: 1 - Expert Intermediate Dimension: 512 - Context Length: 262,144 natively and extensible up to 1,010,000 tokens | | Qwen3-30B-A3B-Instruct-2507 | Qwen3-32B Non-Thinking | Qwen3-235B-A22B-Instruct-2507 | Qwen3-Next-80B-A3B-Instruct | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 78.4 | 71.9 | 83.0 | 80.6 | | MMLU-Redux | 89.3 | 85.7 | 93.1 | 90.9 | | GPQA | 70.4 | 54.6 | 77.5 | 72.9 | | SuperGPQA | 53.4 | 43.2 | 62.6 | 58.8 | | Reasoning | | | | | | AIME25 | 61.3 | 20.2 | 70.3 | 69.5 | | HMMT25 | 43.0 | 9.8 | 55.4 | 54.1 | | LiveBench 20241125 | 69.0 | 59.8 | 75.4 | 75.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 43.2 | 29.1 | 51.8 | 56.6 | | MultiPL-E | 83.8 | 76.9 | 87.9 | 87.8 | | Aider-Polyglot | 35.6 | 40.0 | 57.3 | 49.8 | | Alignment | | | | | | IFEval | 84.7 | 83.2 | 88.7 | 87.6 | | Arena-Hard v2 | 69.0 | 34.1 | 79.2 | 82.7 | | Creative Writing v3 | 86.0 | 78.3 | 87.5 | 85.3 | | WritingBench | 85.5 | 75.4 | 85.2 | 87.3 | | Agent | | | | | | BFCL-v3 | 65.1 | 63.0 | 70.9 | 70.3 | | TAU1-Retail | 59.1 | 40.1 | 71.3 | 60.9 | | TAU1-Airline | 40.0 | 17.0 | 44.0 | 44.0 | | TAU2-Retail | 57.0 | 48.8 | 74.6 | 57.3 | | TAU2-Airline | 38.0 | 24.0 | 50.0 | 45.5 | | TAU2-Telecom | 12.3 | 24.6 | 32.5 | 13.2 | | Multilingualism | | | | | | MultiIF | 67.9 | 70.7 | 77.5 | 75.8 | | MMLU-ProX | 72.0 | 69.3 | 79.4 | 76.7 | | INCLUDE | 71.9 | 70.9 | 79.5 | 78.9 | | PolyMATH | 43.1 | 22.5 | 50.2 | 45.9 | : For reproducibility, we report the win rates evaluated by GPT-4.1. The code for Qwen3-Next has been merged into the main branch of Hugging Face `transformers`. With earlier versions, you will encounter the following error: The following contains a code snippet illustrating how to use the model generate content based on given inputs. > [!Note] > Multi-Token Prediction (MTP) is not generally available in Hugging Face Transformers. > [!Note] > The efficiency or throughput improvement depends highly on the implementation. > It is recommended to adopt a dedicated inference framework, e.g., SGLang and vLLM, for inference tasks. > [!Tip] > Depending on the inference settings, you may observe better efficiency with `flash-linear-attention` and `causal-conv1d`. > See the links for detailed instructions and requirements. For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint. SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service. `sglang>=0.5.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to SGLang's usage guide on Qwen3-Next. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM could be used to launch a server with OpenAI-compatible API service. `vllm>=0.10.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to vLLM's usage guide on Qwen3-Next. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method. YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`. In general, there are two approaches to enabling YaRN for supported frameworks: - Modifying the model files: In the `config.json` file, add the `ropescaling` fields: > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set `factor` as 2.0. We test the model on an 1M version of the RULER benchmark. | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k | |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------| | Qwen3-30B-A3B-Instruct-2507 | 86.8 | 98.0 | 96.7 | 96.9 | 97.2 | 93.4 | 91.0 | 89.1 | 89.8 | 82.5 | 83.6 | 78.4 | 79.7 | 77.6 | 75.7 | 72.8 | | Qwen3-235B-A22B-Instruct-2507 | 92.5 | 98.5 | 97.6 | 96.9 | 97.3 | 95.8 | 94.9 | 93.9 | 94.5 | 91.0 | 92.2 | 90.9 | 87.8 | 84.8 | 86.5 | 84.5 | | Qwen3-Next-80B-A3B-Instruct | 91.8 | 98.5 | 99.0 | 98.0 | 98.7 | 97.6 | 95.0 | 96.0 | 94.0 | 93.5 | 91.7 | 86.9 | 85.5 | 81.7 | 80.3 | 80.3 | Qwen3-Next are evaluated with YaRN enabled. Qwen3-2507 models are evaluated with Dual Chunk Attention enabled. Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each). To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

763

Apertus-8B-Instruct-2509-unsloth-bnb-4bit

1. Model Summary 2. How to use 3. Evaluation 4. Training 5. Limitations 6. Legal Aspects Apertus is a 70B and 8B parameter language model designed to push the boundaries of fully-open multilingual and transparent models. The model supports over 1000 languages and long context, it uses only fully compliant and open training data, and achieves comparable performance to models trained behind closed doors. The model is a decoder-only transformer, pretrained on 15T tokens with a staged curriculum of web, code and math data. The model uses a new xIELU activation function and is trained from scratch with the AdEMAMix optimizer. Post-training included supervised fine-tuning and alignment via QRPO. Key features - Fully open model: open weights + open data + full training details including all data and training recipes - Massively Multilingual: 1811 natively supported languages - Compliant Apertus is trained while respecting opt-out consent of data owners (even retrospectivey), and avoiding memorization of training data The modeling code for Apertus is available in transformers `v4.56.0` and later, so make sure to upgrade your transformers version. You can also load the model with the latest `vLLM` which uses transformers as a backend. >[!TIP] > We recommend setting `temperature=0.8` and `topp=0.9` in the sampling parameters. Apertus by default supports a context length up to 65,536 tokens. Deployment of the models is directly supported by the newest versions of Transformers, vLLM, SGLang, and also for running on-device with MLX, Pretraining Evaluation: Performance (%) of Apertus models on general language understanding tasks (higher is better) compared to other pretrained models. | Model | Avg | ARC | HellaSwag | WinoGrande | XNLI | XCOPA | PIQA | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Fully Open Models | | | | | | | | | Apertus-8B | 65.8 | 72.7 | 59.8 | 70.6 | 45.2 | 66.5 | 79.8 | | Apertus-70B | 67.5 | 70.6 | 64.0 | 73.3 | 45.3 | 69.8 | 81.9 | | OLMo2-7B | 64.0 | 72.9 | 60.4 | 74.5 | 40.4 | 55.2 | 80.9 | | OLMo2-32B | 67.7 | 76.2 | 66.7 | 78.6 | 42.9 | 60.1 | 82.1 | | EuroLLM-1.7B | 54.8 | 57.2 | 44.9 | 58.1 | 40.7 | 55.7 | 72.4 | | EuroLLM-9B | 62.8 | 67.9 | 57.9 | 68.8 | 41.5 | 61.1 | 79.6 | | SmolLM2-1.7B | 58.5 | 66.1 | 52.4 | 65.6 | 37.6 | 52.3 | 77.0 | | SmolLM3-3B | 61.6 | 68.6 | 56.4 | 68.1 | 40.5 | 58.2 | 77.7 | | Poro-34B | 61.7 | 65.7 | 57.9 | 70.6 | 41.6 | 56.0 | 78.5 | | Open-Weight Models | | | | | | | | | Llama3.1-8B | 65.4 | 71.6 | 60.0 | 73.4 | 45.3 | 61.8 | 80.1 | | Llama3.1-70B | 67.3 | 74.4 | 56.5 | 79.4 | 44.3 | 66.7 | 82.3 | | Qwen2.5-7B | 64.4 | 69.6 | 60.1 | 72.8 | 43.3 | 61.7 | 78.7 | | Qwen2.5-72B | 69.8 | 76.2 | 67.5 | 78.0 | 46.9 | 68.2 | 82.0 | | Qwen3-32B | 67.8 | 75.6 | 64.0 | 73.8 | 44.4 | 67.9 | 80.9 | | Llama4-Scout-16x17B | 67.9 | 74.7 | 66.8 | 73.2 | 43.5 | 67.7 | 81.2 | | GPT-OSS-20B | 58.1 | 67.0 | 41.5 | 66.5 | 37.4 | 60.4 | 75.6 | Many additional benchmark evaluations, for pretraining and posttraining phases, multilingual evaluations in around hundred languages, and long context evaluations are provided in Section 5 of the ApertusTechReport.pdf - Architecture: Transformer decoder - Pretraining tokens: 15T - Precision: bfloat16 - GPUs: 4096 GH200 - Training Framework: Megatron-LM - ... Open resources All elements used in the training process are made openly available - Training data reconstruction scripts: github.com/swiss-ai/pretrain-data - The training intermediate checkpoints are available on the different branches of this same repository Apertus can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content. EU AI Act Transparency Documentation and Code of Practice - ApertusEUPublicSummary.pdf - ApertusEUCodeofPractice.pdf Data Protection and Copyright Requests For removal requests of personally identifiable information (PII) or of copyrighted content, please contact the respective dataset owners or us directly - [email protected] - [email protected] Output Filter for PII - Currently no output filter is provided. - Please check this site regularly for an output filter that can be used on top of the Apertus LLM. The filter reflects data protection deletion requests which have been addressed to us as the developer of the Apertus LLM. It allows you to remove Personal Data contained in the model output. We strongly advise downloading and applying this output filter from this site every six months. Contact To contact us, please send an email to [email protected]

NaNK

license:apache-2.0

762

gemma-7b-it

NaNK

license:apache-2.0

757

Phi-4-reasoning-plus-unsloth-bnb-4bit

NaNK

license:mit

757

gemma-2-it-GGUF

—

751

Qwen3-8B-Base-bnb-4bit

unsloth

Meta-Llama-3.1-8B-Instruct-bnb-4bit

gpt-oss-20b-BF16

mistral-7b-v0.3-bnb-4bit

Meta-Llama-3.1-8B-Instruct

Qwen3-8B-bnb-4bit

Llama-3.2-1B-Instruct

Llama-3.2-3B-Instruct

DeepSeek-R1-Distill-Qwen-32B-bnb-4bit

gpt-oss-20b-GGUF

Qwen3-30B-A3B-GGUF

gemma-3-1b-it

llava-1.5-7b-hf-bnb-4bit

gpt-oss-20b-unsloth-bnb-4bit

Qwen3-Next-80B-A3B-Instruct-bnb-4bit

Qwen3-14B-GGUF

gpt-oss-120b-GGUF

gemma-2-9b-it-bnb-4bit

Qwen3-Coder-30B-A3B-Instruct-GGUF

Qwen3-4B-Instruct-2507-unsloth-bnb-4bit

Qwen3-8B-unsloth-bnb-4bit

Qwen2.5-0.5B-Instruct-unsloth-bnb-4bit

DeepSeek-R1-0528-Qwen3-8B-GGUF

Qwen3-0.6B-unsloth-bnb-4bit

Qwen2.5-VL-7B-Instruct-GGUF

Qwen2.5-7B-Instruct

llama-3-8b-Instruct-bnb-4bit

tinyllama-chat-bnb-4bit

meta-Llama-3.1-8B-unsloth-bnb-4bit

Llama-3.2-1B-Instruct-unsloth-bnb-4bit

gpt-oss-120b

gemma-3-4b-it-GGUF

GLM-4.6-GGUF

mistral-7b-instruct-v0.3-bnb-4bit

Llama-3.2-3B-Instruct-unsloth-bnb-4bit

Qwen3-1.7B-unsloth-bnb-4bit

Phi-3-mini-4k-instruct-bnb-4bit

MiniMax-M2-GGUF

gemma-3-4b-it-unsloth-bnb-4bit

granite-4.0-h-small-GGUF

Qwen3-30B-A3B-Instruct-2507-GGUF

Qwen3-VL-8B-Thinking-1M-GGUF

Qwen3-1.7B-GGUF

Llama-3.1-8B-Instruct-unsloth-bnb-4bit

Qwen3-4B-Instruct-2507-GGUF

GLM-4.5-Air-GGUF

gemma-3-12b-it-unsloth-bnb-4bit

Mistral-Nemo-Instruct-2407-bnb-4bit

Qwen3-VL-8B-Instruct-GGUF

gemma-3-4b-it

Qwen3-VL-30B-A3B-Instruct-GGUF

Mistral-Small-24B-Instruct-2501

gemma-3-27b-it-GGUF

Mistral-Small-3.1-24B-Instruct-2503

gemma-3-270m-it

phi-4-unsloth-bnb-4bit

Qwen2.5-14B-Instruct

Qwen3-VL-8B-Instruct-unsloth-bnb-4bit

gemma-3-12b-it-GGUF

Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit

Qwen3-14B-unsloth-bnb-4bit

DeepSeek-V3.1-Terminus-BF16

Qwen3-4B-unsloth-bnb-4bit

llama-3-8b-bnb-4bit

gemma-3-1b-it-GGUF

gemma-3n-E4B-it-GGUF

Qwen2.5-1.5B-unsloth-bnb-4bit

DeepSeek-R1-Distill-Llama-8B-GGUF

Mistral-Small-3.2-24B-Instruct-2506-GGUF

Qwen3-4B-Instruct-2507

Devstral-Small-2507-GGUF

Qwen3-14B

Kimi-K2-Instruct-GGUF

gemma-3-27b-it

mistral-7b-bnb-4bit

Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit

Llama-4-Scout-17B-16E-Instruct-GGUF

Qwen3-Coder-30B-A3B-Instruct

Qwen3-0.6B-GGUF

Qwen2.5-3B-Instruct-unsloth-bnb-4bit