aquif-ai

25 models • 2 total models in database

Sort by:

Aquif 3.5 8B Think

The aquif-3.5 series is the successor to aquif-3, featuring a simplified naming scheme, expanded Mixture of Experts (MoE) options, and across-the-board performance improvements. This release streamlines model selection while delivering enhanced capabilities across reasoning, multilingual support, and general intelligence tasks. | Model | HuggingFace Repository | |-------|----------------------| | aquif-3.5-A0.6B-Preview | aquiffoo/aquif-3.5-A0.6B-Preview | | aquif-3.5-3B | aquiffoo/aquif-3.5-3B | | aquif-3.5-7B | aquiffoo/aquif-3.5-7B | | aquif-3.5-8B-Think | aquiffoo/aquif-3.5-8B-Think | | aquif-3.5-A4B-Think | aquiffoo/aquif-3.5-A4B-Think | | Model | Size (B) | Active Params (B) | Reasoning | MoE | Multilingual | MMLU | Context Window | |-------|----------|-------------------|-----------|-----|--------------|------|----------------| | aquif-3.5-A0.6B | 2.61 | 0.6 | ❌ | ✅ | ✅ | 60.5% | 4k | | aquif-3.5-3B | 2.67 | 2.67 | ❌ | ❌ | ✅ | 70.2% | 32k | | aquif-3.5-7B | 7.3 | 7.3 | ❌ | ❌ | ✅ | 78.5% | 16k | | aquif-3.5-8B-Think | 8.2 | 8.2 | ✅ | ❌ | ✅ | 81.1% | 40k | | aquif-3.5-A4B-Think | 12 | 4 | ✅ | ✅ | ✅ | 86.9% | 128k | An experimental small-scale Mixture of Experts model designed for multilingual applications with minimal computational overhead. Despite its compact active parameter count, it demonstrates competitive performance against larger dense models. | Metric | aquif-3.5 (2.6B A0.6B) | Qwen3 (0.8B) | LFM2 (0.7B) | aquif-3 (0.4B) | |--------|------------------------|--------------|-------------|----------------| | MMLU | 60.5 | 44.9 | 49.9 | 55.6 | | GPQA | 30.2 | 22.1 | 28.5 | 28.5 | | GSM8K | 50.7 | 36.5 | 46.4 | 52.1 | | HumanEval | 45.2 | 36.0 | 40.0 | 37.4 | | Average | 46.7 | 34.9 | 41.2 | 43.4 | The new standard for small dense models, offering optimal performance-per-parameter efficiency for general-purpose applications. | Metric | aquif-3.5 (2.7B) | EXAONE 3.5 (2.4B) | Qwen3 (4B) | Gemma 3 (4B) | Phi-4-mini (3.8B) | Apriel-5B-Instruct (4.8B) | aquif-3 (3.2B) | |--------|------------------|-------------------|------------|--------------|-------------------|---------------------------|----------------| | MMLU (General Knowledge) | 70.2 | 60.4 | 70.4 | 59.6 | 67.3 | 64.6 | 67.5 | | GPQA Diamond (Science) | 35.8 | 28.4 | 39.3 | 30.9 | 25.2 | 28.4 | 36.1 | | LiveCodeBench (Coding) | 23.1 | 12.5 | 21.3 | 11.2 | 10.4 | 11.6 | 15.4 | | IFEval (Instruction Following) | 78.9 | 73.6 | 71.2 | 80.2 | 68.6 | 80.8 | 78.9 | | AIME 2025 (Competition Math) | 13.4 | 4.5 | 9.8 | 12.7 | 5.3 | 4.3 | 9.6 | | Average | 44.3 | 35.9 | 42.4 | 38.9 | 35.4 | 37.9 | 41.5 | A Qwen-based architecture optimized for multilingual applications with extended context capabilities, delivering state-of-the-art performance in its size class. | Metric | aquif-3.5 (7.3B) | EXAONE 3.5 (7.8B) | Qwen3 (8.2B) | Gemma 3 (12B) | Llama 3.1 (8B) | Kanana 1.5 (8B) | aquif-3 (3.2B) | |--------|------------------|-------------------|-------------|---------------|----------------|-----------------|----------------| | MMLU (General Knowledge) | 78.5 | 72.2 | 82.9 | 74.5 | 69.2 | 68.8 | 67.5 | | GPQA Diamond (Science) | 42.3 | 39.4 | 39.3 | 40.9 | 32.8 | 37.5 | 36.1 | | LiveCodeBench (Coding) | 21.3 | 18.0 | 23.9 | 13.7 | 10.8 | 16.5 | 15.4 | | IFEval (Instruction Following) | 85.6 | 82.6 | 85.4 | 80.2 | 75.0 | 80.1 | 78.9 | | AIME 2025 (Competition Math) | 23.4 | 18.3 | 20.9 | 18.8 | 2.7 | 13.4 | 9.6 | | Average | 50.2 | 46.1 | 50.4 | 45.6 | 38.1 | 43.3 | 41.5 | aquif-3.5-8B-Think & aquif-3.5-A4B-Think (Reasoning Models) Advanced reasoning-capable models designed for complex problem-solving tasks. The A4B variant leverages MoE architecture for enhanced efficiency while maintaining superior reasoning performance. | Metric | aquif-3.5 (12B A4B) | aquif-3.5 (8B) | Qwen3 Thinking 2507 (31B A3B) | gpt-oss-20b (21B A4B) | Nemotron Nano v2 (9B) | Solar Pro 2 | |--------|---------------------|-----------------|-------------------------------|----------------------|----------------------|-------------| | MMLU-Pro | 78.5 | 78.1 | 80.5 | 73.6 | 74.2 | 80.5 | | GPQA Diamond | 70.8 | 66.8 | 70.7 | 61.7 | 64.0 | 68.7 | | AIME 2025 | 84.4 | 81.4 | 56.3 | 61.7 | 69.7 | 61.3 | | LiveCodeBench | 66.1 | 61.5 | 70.7 | 72.1 | 71.1 | 61.6 | | Humanity's Last Exam | 8.9 | 8.2 | 9.8 | 8.5 | 6.5 | 7.0 | | TAU-Bench v2 (avg) | 43.7 | 36.8 | 35.7 | 43.2 | 34.9 | 38.7 | | Average | 58.7 | 55.5 | 54.0 | 53.5 | 53.4 | 53.0 | - Simplified Naming: Clear size-based nomenclature for easier model selection - Enhanced MoE Support: Multiple MoE configurations across different model sizes - Reasoning Capabilities: Dedicated thinking models for complex problem-solving - Extended Context: Up to 128k context window for long-form applications - Multilingual by Default: Native multilingual support across all variants - Performance Gains: 5-15% improvement across benchmarks compared to aquif-3 - aquif-3.5-A0.6B: Experimental applications, resource-constrained environments - aquif-3.5-3B: General-purpose applications, balanced performance/efficiency - aquif-3.5-7B: Multilingual applications, long-context tasks - aquif-3.5-8B-Think: Complex reasoning, scientific analysis - aquif-3.5-A4B-Think: Advanced reasoning with efficiency optimization All models support: - BF16 and FP16 precision - Standard transformer architecture optimizations - Efficient attention mechanisms - Multi-head attention with optimized KV caching - Qwen Team: Base architecture for 7B, 8B, and 12B-A4B models - Meta Llama Team: Base architecture for 3B and 2.6B-A0.6B models - Hugging Face: Model hosting infrastructure and training libraries This project is released under the Apache 2.0 License. See LICENSE file for details.

NaNK

license:apache-2.0

193

aquif-4-Exp

aquif-4-Exp is an experimental research preview of the upcoming aquif-4 family of models. It represents a significant architectural departure from the aquif-3.5 series, introducing hybrid attention mechanisms and advanced mixture-of-experts configurations. This model is not positioned as a direct successor to aquif-3.5, but rather as a proof-of-concept for next-generation innovations in the aquif model family. News - [10.20.2025] 🔥 SGLang wheel for Aquif4Linear released - [10.18.2025] 🔥 vLLM wheel for Aquif4Linear released - [10.17.2025] 🔥 GitHub repo for aquif-4 created here - [10.15.2025] 🔥 aquif-4-Exp (16B A3B) released | Attribute | Value | |-----------|-------| | Total Parameters | 16.45B | | Active Parameters | 3.2B | | Activation Ratio | 1:16 | | Expert Count | 256 | | Experts per Token | 16 | | Attention Type | Hybrid (Softmax + Linear) | | Context Window | 128K (expandable to 512K via YaRN) | | Is Reasoning Model? | ✅ | | Model Type | Mixture-of-Experts (MoE) | aquif-4-Exp is the first aquif model to implement a hybrid attention architecture combining: - Softmax Attention: Applied at strategic layers for precise token interactions and complex reasoning patterns - Linear Attention: Leverages Lightning Attention-2 (https://arxiv.org/abs/2401.04658) for efficient long-context processing This combination enables efficient processing of extended sequences while maintaining the reasoning capabilities necessary for complex problem-solving tasks. - 256 total experts with a 16 expert activation strategy - 1:16 activation ratio provides exceptional parameter efficiency - Only 3.2B parameters are active during inference, enabling deployment on resource-constrained hardware while maintaining performance comparable to much larger dense models - Expert routing is optimized for both training stability and inference efficiency - 128K native context window for long-document processing - Expandable to 512K tokens using YaRN (Yet another RoPE extensioN) without full retraining - Efficient handling of multi-document scenarios and extensive code repositories The aquif-4-Exp implementation builds upon the Aquif4Linear architecture, featuring: - Rotary Position Embeddings (RoPE) with optional scaling via YaRN - Group-normalized RMSNorm for stable layer normalization across attention heads - Efficient KV caching for accelerated inference - Optimized flash-linear-attention operators from the FLA library As an experimental research model, aquif-4-Exp demonstrates: - Reasoning-focused performance: Optimized for complex problem-solving and multi-step inference - Efficiency at scale: 3.2B active parameters achieve competitive performance with larger models - Multilingual support: Native support for English, German, Italian, Portuguese, French, Hindi, Spanish, Thai, Chinese, and Japanese - Long-context understanding: Maintains coherence and reasoning quality across extended sequences Figure 1: aquif-4-Exp and aquif-3.5-Think on Context Length x Normalized Prefill Throughput Figure 2: aquif-4-Exp and aquif-3.5-Think on Generation Length x Normalized Decode Throughput Figure 3: aquif-4-Exp and others evaluated on MMLU-Pro, AIME 2025, LiveCodeBench and GPQA Diamond (Chart). | Metric | aquif-4-Exp (16B A3.2B) | aquif-3.5-Think (8.2B) | Qwen3-VL-Thinking-2510 (8.8B) | Ring-mini-2.0 (16.3B A1.4B) | gpt-oss (21B A3.6B) | |--------|-------------------------|------------------------|--------------------------------|------------------------------|---------------------| | MMLU-Pro | 76.9 | 78.1 | 77.3 | 66.8 | 71.5 | | AIME 2025 | 82.3 | 81.4 | 80.3 | 74.1 | 72.1 | | LiveCodeBench | 65.7 | 61.5 | 58.6 | 62.6 | 54.9 | | GPQA Diamond | 70.1 | 66.8 | 69.9 | 68.2 | 66.0 | | Average | 73.8 | 72.0 | 71.5 | 67.9 | 66.1 | Figure 4: aquif-4-Exp and others evaluated on MMLU-Pro, AIME 2025, LiveCodeBench and GPQA Diamond (Table). Note: aquif-4-Exp is currently supported only through the Hugging Face Transformers library. Support for llama.cpp, vLLM, and SGLang is coming soon and will be available with the full aquif-4 family release. To use the model with context windows beyond the default 128K tokens, you can configure YaRN scaling in the model's configuration before loading: - Transformers (Native): ✅ Full support - vLLM: ✅ Support through wheel - SGLang: ✅ Support through wheel - llama.cpp: ❌ Not supported Framework support will be expanded with the full aquif-4 family release. - Research applications exploring hybrid attention mechanisms and MoE architectures - Reasoning-heavy tasks requiring interpretable chain-of-thought outputs - Long-context processing for documents, code analysis, and multi-turn conversations - Efficiency-critical deployments where parameter count matters as much as performance - Experimental status: This is a research preview. Stability and performance may evolve with updates - CoT overhead: Chain-of-thought reasoning increases generation latency compared to direct answering - Hardware requirements: Despite 3.2B active parameters, peak memory usage during inference can be higher due to expert loading - Not a full successor: aquif-4-Exp does not replace aquif-3.5 for production use cases; it represents architectural exploration - Transformers-only: Currently requires Hugging Face Transformers; integration with other frameworks is forthcoming - Attention Implementation: Hybrid softmax + linear (Lightning Attention-2) - Precision Support: BF16, FP16 - Position Encoding: RoPE with YaRN scaling capability - Training Data: Multilingual corpus spanning 10+ languages - Model Family: First of the upcoming aquif-4 experimental series - Use flash-attention-2 or SDPA for softmax attention layers when available - Consider YaRN configuration for context windows beyond 128K - Monitor VRAM usage with full expert loading enabled - Leverage KV caching for multi-turn conversations - Ensure `trustremotecode=True` is set when loading from Hugging Face Hub aquif-4-Exp represents the first experimental release in the aquif-4 family exploration. The full aquif-4 release will not be a single model, but rather a comprehensive family of models with varying architectures, sizes, and specializations, all leveraging the innovations demonstrated in this experimental preview. - aquif AI Research Team: Architecture design and optimization - EleutherAI & HuggingFace: GPT-NeoX and modeling foundations - Flash Linear Attention Project: FLA library for efficient kernel implementations - Lightning Attention Authors: Attention mechanism research This project is released under the Apache 2.0 License. Note: aquif-4-Exp is a research release. For production applications, please refer to the aquif-3.5 model series. Feedback and findings from this experimental release will inform the development of the full aquif-4 family.

license:apache-2.0

190

aquif-Image-14B

NaNK

license:apache-2.0

181

aquif-AlphaMoE-7.5B-A3B

aquif-AlphaMoE is the first foundational model designed entirely by aquif AI, marking a shift from third-party based architectures (used in aquif-3 and aquif-3.5) toward an in-house architecture family. Released on October 1, 2025, AlphaMoE debuts the `AquifAlphaMoEForCausalLM` design, a scalable Mixture of Experts (MoE) framework that balances efficiency, reasoning, and multilingual capability. This release represents aquif AI’s first step into independent foundational model architecture design, with a focus on modular expert scaling, long-context performance, and efficient parameter utilization. | Model | HuggingFace Repository | |-------|------------------------| | aquif-AlphaMoE-7.5B-A3B | aquif-ai/aquif-AlphaMoE-7.5B-A3B | | Model | Total Params (B) | Active Params (B) | Experts (Total / Active) | Context | Attention | Vocab Size | MMLU | GPQA-D | LiveCodeBench | Math-500 | Average | |-------|------------------|-------------------|--------------------------|---------|-----------|------------|------|--------|---------------|----------|---------| | aquif-AlphaMoE-7.5B-A3B | 7.47 | 2.92 | 64 / 4 | 164k | GQA (16 heads) | 128k | 86.7 | 60.1 | 35.9 | 87.3 | 67.5 | | Metric | AlphaMoE (7.5B A3B) | aquif-3-moe (17B A2.8B) | Ling-mini-2.0 (16B A1.4B) | Qwen3-Instruct-2507 (4B) | aquif-3.5 (7.3B) | Granite-4.0-HS (32B A9B) | Gemma-3 (12.2B) | |---------------|---------------------|--------------------------|---------------------------|--------------------------|------------------|--------------------------|----------------| | MMLU | 84.3 | 83.2 | 80.9 | 81.6 | 78.5 | 78.5 | 78.5 | | GPQA-Diamond | 57.5 | 56.7 | 54.3 | 49.6 | 42.3 | 41.6 | 34.9 | | LiveCodeBench | 35.9 | 28.6 | 34.8 | 31.9 | 21.3 | 25.1 | 13.7 | | Math-500 | 87.3 | 91.4 | 89.4 | 84.4 | 90.2 | 85.4 | 82.4 | | Average | 66.3 | 65.0 | 64.9 | 61.9 | 58.1 | 57.7 | 52.4 | - First Foundational Architecture: Designed from scratch by aquif AI, unlike aquif-3 and 3.5 which relied on third-party bases. - Scalable MoE Design: 64 total experts with 4 active per token, enabling dynamic compute allocation. - High Efficiency: 7.47B total parameters but only 2.92B active, delivering strong performance-to-compute ratios. - Extended Context: 164k token context window for long-form reasoning and document handling. - Strong Benchmarks: Surpasses previous aquif generations and peer models in general knowledge, science, and code tasks. - Multilingual Support: Optimized for 10+ major languages, ensuring broad usability. - Architecture Name: `AquifAlphaMoEForCausalLM` - Total Parameters: 7.47B - Active Parameters: 2.92B - Total Experts: 64 - Active Experts: 4 - Context Window: 164k tokens - Attention Mechanism: GQA with 16 heads - Vocabulary Size: 128k - Supported Precisions: FP16, BF16 This project is released under the MIT (prev. Apache 2.0) license. See LICENSE file for details.

NaNK

license:mit

aquif-3.5-A4B-Think

NaNK

license:apache-2.0

aquif-3-micro

license:apache-2.0

aquif-3-moe-17B-A2.8B

A high-performance mixture-of-experts language model optimized for efficiency, coding, science, and general use. With 17B total parameters and 2.8B active parameters, aquif-3-moe delivers competitive performance across multiple domains while maintaining computational efficiency. Architecture: Mixture of Experts (MoE) Total Parameters: 17 billion Active Parameters: 2.8 billion License: Apache 2.0 Library: transformers | Metric | aquif-3-moe (17B a2.8B) | Phi-4 (14B) | Qwen3 (14B) | Gemma 3 (27B) | GPT-4.1 nano (Propr.) | Mistral Small 3.2 (24B) | |--------|-------------------------|-------------|-------------|---------------|----------------------|-------------------------| | MMLU (General Knowledge) | 83.2 | 84.8 | 82.0 | 78.6 | 80.1 | 80.5 | | LiveCodeBench (Coding) | 28.6 | 25.2 | 29.0 | 26.9 | 32.6 | 27.5 | | MATH-500 (Math) | 91.4 | 80.8 | 89.8 | 88.3 | 84.8 | 88.3 | | GPQA Diamond (Science) | 56.7 | 56.1 | 54.8 | 42.8 | 50.3 | 50.5 | | Average | 65.0 | 61.7 | 63.9 | 59.2 | 62.0 | 61.7 | - Mathematical Reasoning: Achieves 91.4% on MATH-500, demonstrating exceptional mathematical problem-solving capabilities - Scientific Understanding: Leads in GPQA Diamond with 56.7%, showing strong scientific reasoning - Efficiency: Delivers competitive performance with only 2.8B active parameters - General Knowledge: Strong MMLU performance at 83.2% - Mathematical problem solving and reasoning - Scientific research and analysis - Code generation and programming assistance - General question answering and text generation - Educational content creation The mixture-of-experts architecture enables efficient scaling by activating only a subset of parameters for each input, providing the benefits of a larger model while maintaining computational efficiency comparable to much smaller dense models.

NaNK

license:apache-2.0

aquif-moe-400M

NaNK

license:apache-2.0

aquif-3.5-3B

NaNK

llama

aquif-3-moe-7B-A1B-Preview

NaNK

license:apache-2.0

aquif-3-moe-17B-A2.8B-Think

A high-performance mixture-of-experts language model optimized for efficiency, coding, science, and general use. With 17B total parameters and 2.8B active parameters, aquif-3-moe delivers competitive performance across multiple domains while maintaining computational efficiency. Architecture: Mixture of Experts (MoE) Total Parameters: 17 billion Active Parameters: 2.8 billion License: Apache 2.0 Library: transformers | Metric | aquif-3-moe (Thinking 17B a2.8B) | Phi-4 (Thinking 14B) | Qwen3 (Thinking 8B) | DeepSeek R1 (Qwen3 8B) | Magistral Small (24B) | Gemini 2.5 Flash-Lite (Propr.) | | ---------------------- | -------------------------------- | -------------------- | ------------------- | ---------------------- | --------------------- | ------------------------------ | | LiveCodeBench (Coding) | 63.2 | 53.8 | 58.1 | 60.5 | 51.4 | 59.3 | | AIME 2024 (Math) | 80.2 | 75.3 | 74.7 | 65.0 | 71.3 | 70.3 | | GPQA Diamond (Science) | 64.2 | 65.8 | 62.0 | 61.1 | 64.1 | 62.5 | | Average | 69.2 | 65.0 | 64.9 | 62.2 | 62.3 | 64.0 | - Mathematical Reasoning: Achieves 91.4% on MATH-500, demonstrating exceptional mathematical problem-solving capabilities - Scientific Understanding: Leads in GPQA Diamond with 56.7%, showing strong scientific reasoning - Efficiency: Delivers competitive performance with only 2.8B active parameters - General Knowledge: Strong MMLU performance at 83.2% - Mathematical problem solving and reasoning - Scientific research and analysis - Code generation and programming assistance - General question answering and text generation - Educational content creation The mixture-of-experts architecture enables efficient scaling by activating only a subset of parameters for each input, providing the benefits of a larger model while maintaining computational efficiency comparable to much smaller dense models.

NaNK

license:apache-2.0

aquif-3.6-8B

aquif-3.6-8B is a hybrid reasoning model that automatically determines when and how deeply to think based on query complexity. Built on aquif-3.5-8B-Think with AutoThink RL data, it achieves 28% better token efficiency and 4% performance improvement across benchmarks. - Key Features - Dynamic reasoning, efficiency gains, and smart resource allocation - Performance - Benchmark results showing 4% average improvement - Token Efficiency - 28% reduction in token usage - Thinking Ratio - 12% reduction in thinking frequency - Benchmark Highlights - Detailed results for AIME, LiveCodeBench, and GPQA Diamond - Model Details - Architecture and specifications - Usage - Code examples for implementation - Previous Versions - Links to earlier models aquif-3.6-8B is a hybrid reasoning model that dynamically decides if and how much to think based on query complexity. Inspired by KAT-V1's approach of automatic thinking using AutoThink RL data on top of aquif-3.5-8B-Think, the model uses the following format: This is the same format as KAT-V1-40B. Unlike something like DeepSeek-V3.1's toggleable reasoning that requires manual control (thinkingon/off), aquif-3.6's judge autonomously allocates reasoning depth - intelligently adapting its cognitive effort to each task automatically. - 🧠 Dynamic Reasoning: Automatically determines when and how deeply to think - ⚡ 28% More Efficient: Significant token reduction while improving performance - 📈 Better Performance: 4% average improvement across benchmarks - 🎯 Smart Resource Allocation: 12% reduction in thinking ratio on average Benchmark | aquif-3.6-8B | aquif-3.5-8B | Improvement | |-----------|--------------|--------------|-------------| | AIME 2025 | 82.5 | 81.4 | +1% | | LiveCodeBench | 64.2 | 61.5 | +4% | | GPQA Diamond | 71.0 | 66.8 | +6% | | Average | 72.6 | 69.9 | +4% | | Benchmark | aquif-3.6-8B | aquif-3.5-8B | Reduction | |-----------|--------------|--------------|-----------| | AIME 2025 | 15,670 | 21,265 | -26% | | LiveCodeBench | 13,240 | 19,460 | -32% | | GPQA Diamond | 8,760 | 11,560 | -24% | | Average | 12,557 | 17,428 | -28% | | Benchmark | aquif-3.6-8B | aquif-3.5-8B | Reduction | |-----------|--------------|--------------|-----------| | AIME 2025 | 93.0% | 100.0% | -7% | | LiveCodeBench | 82.0% | 100.0% | -18% | | GPQA Diamond | 89.0% | 100.0% | -11% | | Average | 88.0% | 100.0% | -12% | - AIME 2025: 26% fewer tokens, +1% performance, -7% thinking ratio - LiveCodeBench: 32% fewer tokens, +4% performance, -18% thinking ratio - GPQA Diamond: 24% fewer tokens, +6% performance, -11% thinking ratio - Base Model: 8B parameters - Architecture: Hybrid reasoning with dynamic thinking allocation - Context Length: 40K tokens - License: Apache 2.0

NaNK

license:apache-2.0

aquif-moe-800M

NaNK

license:apache-2.0

aquif-3.5-7B

NaNK

license:apache-2.0

aquif-3-mini

A high-performance 3.2B parameter language model based on Meta's Llama 3.2 architecture, optimized for efficiency while maintaining strong capabilities across multiple domains including general knowledge, science, mathematics, coding, and multilingual tasks. Base Model: meta-llama/Llama-3.2-3B Architecture: Llama Parameter Count: 3.2 billion parameters Languages: English, German, Italian, Portuguese, French, Hindi, Spanish, Thai, Chinese, Japanese | Metric | aquif-3-mini (3.2B) | Llama 3.2 (3.2B) | Qwen3 (4B) | Gemma 3n E4B (8.4B) | SmolLM3 (3.1B) | Phi-4 mini (3.8B) | Granite 3.3 (2.5B) | |--------|---------------------|-------------------|------------|---------------------|-----------------|-------------------|-------------------| | MMLU (General Knowledge) | 67.5 | 63.4 | 67.0 | 64.9 | 59.5 | 67.3 | 55.9 | | GPQA Diamond (Science) | 36.1 | 29.4 | 40.7 | 29.6 | 35.7 | 36.9 | 25.3 | | AIME 2025 (Competition Math) | 9.6 | 0.3 | 17.1 | 11.6 | 9.3 | 10.0 | 2.5 | | LiveCodeBench (Coding) | 15.4 | 8.3 | 23.3 | 14.6 | 15.2 | 12.6 | 9.4 | | Global MMLU (Multilingual) | 58.0 | 46.8 | 65.1 | 53.1 | 53.5 | 49.3 | 49.7 | | IFEval (Instruction Following) | 78.9 | 71.6 | 68.9 | 56.8 | 76.7 | 70.1 | 65.8 | | BFCL Simple (Tool Calling) | 92.3 | 78.6 | 81.3 | 71.8 | 88.8 | 70.3 | 72.2 | - Exceptional Tool Calling: Achieves 92.3% on BFCL Simple benchmark, outperforming all comparison models - Strong Instruction Following: 78.9% on IFEval, demonstrating reliable adherence to complex instructions - Comprehensive Knowledge: 70.6% on MMLU, matching or exceeding larger models - Advanced Reasoning: 46.7% on GPQA Diamond, showing strong scientific reasoning capabilities - Multilingual Competency: Supports 10 languages with competitive performance We gratefully acknowledge: - Meta AI for the foundational Llama 3.2 architecture and pre-trained weights - Hugging Face for the transformers library and model hosting platform that enables easy access and deployment For questions, issues, or collaboration opportunities, please reach out through the Hugging Face model page.

NaNK

llama