NexaAI
DeepSeek-OCR-GGUF
Qwen3-VL-8B-Thinking-GGUF
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. Run Qwen3-VL-8B-Thinking optimized for CPU/GPU with NexaSDK. 1. Install NexaSDK and create a free account at NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-8B-Thinking is an 8-billion-parameter multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL (Vision-Language) family, it is designed for deep multimodal reasoning — combining visual understanding, long-context comprehension, and structured chain-of-thought generation across text, images, and videos. The Thinking variant focuses on advanced reasoning transparency and analytical precision. Compared to the Instruct version, it produces richer intermediate reasoning steps, enabling detailed explanation, planning, and multi-hop analysis across visual and textual inputs. Features - Deep Visual Reasoning: Interprets complex scenes, charts, and documents with multi-step logic. - Chain-of-Thought Generation: Produces structured reasoning traces for improved interpretability and insight. - Extended Context Handling: Maintains coherence across longer multimodal sequences. - Multilingual Competence: Understands and generates in multiple languages for global applicability. - High Accuracy at 8B Scale: Achieves strong benchmark performance in multimodal reasoning and analysis tasks. Use Cases - Research and analysis requiring visual reasoning transparency - Complex multimodal QA and scientific problem solving - Visual analytics and explanation generation - Advanced agent systems needing structured thought or planning steps - Educational tools requiring detailed, interpretable reasoning Inputs and Outputs Input: - Text, image(s), or multimodal combinations (including sequential frames or documents) - Optional context for multi-turn or multi-modal reasoning Output: - Structured reasoning outputs with intermediate steps - Detailed answers, explanations, or JSON-formatted reasoning traces License Refer to the official Qwen license for usage and redistribution details.
Qwen3-VL-4B-Thinking-GGUF
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. Run Qwen3-VL-4B-Thinking optimized for CPU/GPU with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal large language model from the Qwen team at Alibaba Cloud. Part of the Qwen3-VL (Vision-Language) family, it is designed for advanced visual reasoning and chain-of-thought generation across image, text, and video inputs. Compared to the Instruct variant, the Thinking model emphasizes deeper multi-step reasoning, analysis, and planning. It produces detailed, structured outputs that reflect intermediate reasoning steps, making it well-suited for research, multimodal understanding, and agentic workflows. Features - Vision-Language Understanding: Processes images, text, and videos for joint reasoning tasks. - Structured Thinking Mode: Generates intermediate reasoning traces for better transparency and interpretability. - High Accuracy on Visual QA: Performs strongly on visual question answering, chart reasoning, and document analysis benchmarks. - Multilingual Support: Understands and responds in multiple languages. - Optimized for Efficiency: Delivers strong performance at 4B scale for on-device or edge deployment. Use Cases - Multimodal reasoning and visual question answering - Scientific and analytical reasoning tasks involving charts, tables, and documents - Step-by-step visual explanation or tutoring - Research on interpretability and chain-of-thought modeling - Integration into agent systems that require structured reasoning Inputs and Outputs Input: - Text, images, or combined multimodal prompts (e.g., image + question) Output: - Generated text, reasoning traces, or structured responses - May include explicit thought steps or structured JSON reasoning sequences License Check the official Qwen license for terms of use and redistribution.
Qwen2-Audio-7B-GGUF
We're bringing Qwen2-Audio to run locally on edge devices with Nexa-SDK, offering various GGUF quantization options. Qwen2-Audio is a SOTA small-scale multimodal model (AudioLM) that handles audio and text inputs, allowing you to have voice interactions without ASR modules. Qwen2-Audio supports English, Chinese, and major European languages,and provides voice chat and audio analysis capabilities for local use cases like: - Speaker identification and response - Speech translation and transcription - Mixed audio and noise detection - Music and sound analysis In the following, we demonstrate how to run Qwen2-Audio locally on your device. Step 1: Install Nexa-SDK (local on-device inference framework) > Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. Step 2: Then run the following code in your terminal For terminal: 1. Drag and drop your audio file into the terminal (or enter file path on Linux) 2. Add text prompt to guide analysis or leave empty for direct voice input Choose Quantizations for your device Run different quantization versions here and check RAM requirements in our list. Voice Chat - Answer daily questions - Offer suggestions - Speaker identification and response - Speech translation - Detecting background noise and responding accordingly Audio Analysis - Information Extraction - Audio summary - Speech Transcription and Expansion - Mixed audio and noise detection - Music and sound analysis Results demonstrate that Qwen2-Audio significantly outperforms either previous SOTAs or Qwen-Audio across all tasks.
gemma-3n
Qwen3-VL-4B-Instruct-GGUF
Run Qwen3-VL-4B-Instruct optimized for CPU/GPU with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Instruct is a 4-billion-parameter instr...
OmniVLM-968M
🔥 Latest Update - [Dec 16, 2024] Our work "OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference" is now live on Arxiv! 🚀 - [Nov 27, 2024] Model Improvements: OmniVLM v3 model's GGUF file has been updated in this Hugging Face Repo! ✨ 👉 Test these exciting changes in our Hugging Face Space - [Nov 22, 2024] Model Improvements: OmniVLM v2 model's GGUF file has been updated in this Hugging Face Repo! ✨ Key Improvements Include: - Enhanced Art Descriptions - Better Complex Image Understanding - Improved Anime Recognition - More Accurate Color and Detail Detection - Expanded World Knowledge We are continuously improving OmniVLM-968M based on your valuable feedback! More exciting updates coming soon - Stay tuned! ⭐ OmniVLM is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features: - 9x Token Reduction: Reduces image tokens from 729 to 81, cutting latency and computational cost aggressively. Note that the computation of vision encoder and the projection part keep the same, but the computation of language model backbone is reduced due to 9X shorter image token span. - Trustworthy Result: Reduces hallucinations using DPO training from trustworthy data. Quick Links: 1. Interactive Demo in our Hugging Face Space. (Updated 2024 Nov 21) 2. Quickstart for local setup 3. Learn more in our Blogs Feedback: Send questions or comments about the model in our Discord Intended Use Cases OmniVLM is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), making it ideal for on-device applications. Example Demo: Generating captions for a 1046×1568 image on M4 Pro Macbook takes Below we demonstrate a figure to show how OmniVLM performs against nanollava. In all the tasks, OmniVLM outperforms the previous world's smallest vision-language model. We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of OmniVLM. | Benchmark | Nexa AI OmniVLM v2 | Nexa AI OmniVLM v1 | nanoLLAVA | |-------------------|------------------------|------------------------|-----------| | ScienceQA (Eval) | 71.0 | 62.2 | 59.0 | | ScienceQA (Test) | 71.0 | 64.5 | 59.0 | | POPE | 93.3 | 89.4 | 84.1 | | MM-VET | 30.9 | 27.5 | 23.9 | | ChartQA (Test) | 61.9 | 59.2 | NA | | MMMU (Test) | 42.1 | 41.8 | 28.6 | | MMMU (Eval) | 40.0 | 39.9 | 30.4 | How to Use On Device In the following, we demonstrate how to run OmniVLM locally on your device. Step 1: Install Nexa-SDK (local on-device inference framework) > Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. Step 2: Then run the following code in your terminal Model Architecture ## OmniVLM's architecture consists of three key components: - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens. The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. We developed OmniVLM through a three-stage training pipeline: Pretraining: The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. Supervised Fine-tuning (SFT): We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. Direct Preference Optimization (DPO): The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics What's next for OmniVLM? OmniVLM is in early development and we are working to address current limitations: - Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality. - Improve document and text understanding In the long term, we aim to develop OmniVLM as a fully optimized, production-ready solution for edge AI multimodal applications.
Qwen3-0.6B-GGUF
Run them directly with nexa-sdk installed In nexa-sdk CLI: Available Quantizations | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-0.6B-Q80.gguf | Q80 | 805 MB | false | High quality 8-bit quantization. Recommended for efficient inference. | | Qwen3-0.6B-f16.gguf | f16 | 1.51 GB | false | Half-precision (FP16) format. Better accuracy, requires more memory. | Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
Qwen3-VL-2B-Thinking-GGUF
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. Quickstart: - Download NexaSDK with one click - one line of code to run in your terminal: Model Description Qwen3-VL-2B-Thinking is a 2-billion-parameter multimodal model from the Qwen3-VL family, optimized for explicit reasoning and step-by-step visual understanding. It builds upon Qwen3-VL-2B with additional “thinking” supervision, allowing the model to explain its reasoning process across both text and images—ideal for research, education, and agentic applications requiring transparent decision traces. Features - Visual reasoning: Performs detailed, interpretable reasoning across images, diagrams, and UI elements. - Step-by-step thought traces: Generates intermediate reasoning steps for transparency and debugging. - Multimodal understanding: Supports text, images, and video inputs with consistent logical grounding. - Compact yet capable: 2B parameters, optimized for low-latency inference and on-device deployment. - Instruction-tuned: Enhanced alignment for “think-aloud” question answering and visual problem solving. Use Cases - Visual question answering with reasoning chains - Step-by-step image or chart analysis for education and tutoring - Debuggable AI agents and reasoning assistants - Research on interpretable multimodal reasoning - On-device transparent AI inference for visual domains Inputs and Outputs Inputs - Text prompts or questions - Images, diagrams, or UI screenshots - Optional multi-turn reasoning chains Outputs - Natural language answers with explicit thought steps - Detailed reasoning traces combining visual and textual logic License This model is released under the Apache 2.0 License. Refer to the official Hugging Face page for license details and usage terms. References - Qwen3-VL-2B-Thinking on Hugging Face - Qwen3 Technical Report (arXiv) - Qwen GitHub Repository
Qwen3-4B-GGUF
Run them directly with nexa-sdk installed In nexa-sdk CLI: Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-4B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
OmniAudio-2.6B
gpt-oss-20b-GGUF
octo-net-gguf
Qwen3-VL-8B-Instruct-GGUF
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. Run Qwen3-VL-8B-Instruct optimized for CPU/GPU with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code:
Octopus-v2-gguf-awq
Qwen3-VL-2B-Instruct-GGUF
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. Quickstart: - Download NexaSDK with one click - one line of code to run in your terminal: Model Description Qwen3-VL-2B-Instruct is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning. Features - Multimodal I/O: Understands images and long videos, performs OCR, and handles mixed image-text prompts. - Long-context reasoning: Up to 256K context for books, documents, or extended visual analysis. - Spatial & temporal understanding: Improved grounding and temporal event tracking for videos. - Agentic capabilities: Recognizes UI elements and reasons about screen layouts for tool use. - Lightweight footprint: 2B parameters for efficient inference across CPU, GPU, or NPU. Use Cases - Visual question answering, captioning, and summarization - OCR and document understanding (multi-page, multilingual) - Video analysis and highlight detection - On-device visual assistants and UI automation agents - Edge analytics and lightweight IoT vision tasks Inputs and Outputs Inputs - Text prompts - Images (single or multiple) - Videos or frame sequences - Mixed multimodal chat turns Outputs - Natural language answers, captions, and visual reasoning - OCR text and structured visual information License This model is released under the Apache 2.0 License. Please refer to the Hugging Face model card for detailed licensing and usage information.
qwen3vl-30B-A3B-mlx
gemma-2-2b-it-GGUF
qwen2.5vl
Octopus-v2
Octopus V2: On-device language model for super agent Octopus V4 Release We are excited to announce that Octopus v4 is now available! Octopus-V4-3B, an advanced open-source language model with 3 billion parameters, serves as the master node in Nexa AI's envisioned graph of language models. Tailored specifically for the MMLU benchmark topics, this model efficiently translates user queries into formats that specialized models can effectively process. It excels at directing these queries to the appropriate specialized model, ensuring precise and effective query handling. check our papers and repos: - paper - Octopus V4 model page - Octopus V4 quantized model page - Octopus V4 github Key Features of Octopus v4: - 📱 Compact Size: Octopus-V4-3B is compact, enabling it to operate on smart devices efficiently and swiftly. - 🐙 Accuracy: Octopus-V4-3B accurately maps user queries to the specialized model using a functional token design, enhancing its precision. - 💪 Reformat Query: Octopus-V4-3B assists in converting natural human language into a more professional format, improving query description and resulting in more accurate responses. Octopus V3 Release We are excited to announce that Octopus v3 is now available! check our technical report and Octopus V3 tweet! Key Features of Octopus v3: - Efficiency: Sub-billion parameters, making it less than half the size of its predecessor, Octopus v2. - Multi-Modal Capabilities: Proceed both text and images inputs. - Speed and Accuracy: Incorporate our patented functional token technology, achieving function calling accuracy on par with GPT-4V and GPT-4. - Multilingual Support: Simultaneous support for English and Mandarin. Check the Octopus V3 demo video for Android and iOS. Octopus V2 Release After open-sourcing our model, we got many requests to compare our model with Apple's OpenELM and Microsoft's Phi-3. Please see Evaluation section. From our benchmark dataset, Microsoft's Phi-3 achieves accuracy of 45.7% and the average inference latency is 10.2s. While Apple's OpenELM fails to generate function call, please see this screenshot. Our model, Octopus V2, achieves 99.5% accuracy and the average inference latency is 0.38s. We are a very small team with many work. Please give us more time to prepare the code, and we will open source it. We hope Octopus v2 model will be helpful for you. Let's democratize AI agents for everyone. We've received many requests from car industry, health care, financial system etc. Octopus model is able to be applied to any function, and you can start to think about it now. Octopus-V2-2B, an advanced open-source language model with 2 billion parameters, represents Nexa AI's research breakthrough in the application of large language models (LLMs) for function calling, specifically tailored for Android APIs. Unlike Retrieval-Augmented Generation (RAG) methods, which require detailed descriptions of potential function arguments—sometimes needing up to tens of thousands of input tokens—Octopus-V2-2B introduces a unique functional token strategy for both its training and inference stages. This approach not only allows it to achieve performance levels comparable to GPT-4 but also significantly enhances its inference speed beyond that of RAG-based methods, making it especially beneficial for edge computing devices. 📱 On-device Applications: Octopus-V2-2B is engineered to operate seamlessly on Android devices, extending its utility across a wide range of applications, from Android system management to the orchestration of multiple devices. 🚀 Inference Speed: When benchmarked, Octopus-V2-2B demonstrates a remarkable inference speed, outperforming the combination of "Llama7B + RAG solution" by a factor of 36X on a single A100 GPU. Furthermore, compared to GPT-4-turbo (gpt-4-0125-preview), which relies on clusters A100/H100 GPUs, Octopus-V2-2B is 168% faster. This efficiency is attributed to our functional token design. 🐙 Accuracy: Octopus-V2-2B not only excels in speed but also in accuracy, surpassing the "Llama7B + RAG solution" in function call accuracy by 31%. It achieves a function call accuracy comparable to GPT-4 and RAG + GPT-3.5, with scores ranging between 98% and 100% across benchmark datasets. 💪 Function Calling Capabilities: Octopus-V2-2B is capable of generating individual, nested, and parallel function calls across a variety of complex scenarios. You can run the model on a GPU using the following code. The benchmark result can be viewed in this excel, which has been manually verified. Microsoft's Phi-3 model achieved an accuracy of 45.7%, with an average inference latency of 10.2 seconds. Meanwhile, Apple's OpenELM was unable to generate a function call, as shown in this screenshot. Additionally, OpenELM's score on the MMLU benchmark is quite low at 26.7, compared to Google's Gemma 2B, which scored 42.3. Note: One can notice that the query includes all necessary parameters used for a function. It is expected that query includes all parameters during inference as well. Training Data We wrote 20 Android API descriptions to used to train the models, see this file for details. The Android API implementations for our demos, and our training data will be published later. Below is one Android API description example License This model was trained on commercially viable data. For use of our model, refer to the license information. References We thank the Google Gemma team for their amazing models! Contact Please contact us to reach out for any issues and comments!
Qwen2.5-Omni-3B-GGUF
octo-net
sdxl-turbo
SDXL-Turbo is a fast generative text-to-image model that can synthesize photorealistic images from a text prompt in a single network evaluation. A real-time demo is available here: http://clipdrop.co/stable-diffusion-turbo Please note: For commercial use, please refer to https://stability.ai/license. Model Description SDXL-Turbo is a distilled version of SDXL 1.0, trained for real-time synthesis. SDXL-Turbo is based on a novel training method called Adversarial Diffusion Distillation (ADD) (see the technical report), which allows sampling large-scale foundational image diffusion models in 1 to 4 steps at high image quality. This approach uses score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal and combines this with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. - Developed by: Stability AI - Funded by: Stability AI - Model type: Generative text-to-image model - Finetuned from model: SDXL 1.0 Base For research purposes, we recommend our `generative-models` Github repository (https://github.com/Stability-AI/generative-models), which implements the most popular diffusion frameworks (both training and inference). - Repository: https://github.com/Stability-AI/generative-models - Paper: https://stability.ai/research/adversarial-diffusion-distillation - Demo: http://clipdrop.co/stable-diffusion-turbo The charts above evaluate user preference for SDXL-Turbo over other single- and multi-step models. SDXL-Turbo evaluated at a single step is preferred by human voters in terms of image quality and prompt following over LCM-XL evaluated at four (or fewer) steps. In addition, we see that using four steps for SDXL-Turbo further improves performance. For details on the user study, we refer to the research paper. The model is intended for both non-commercial and commercial usage. You can use this model for non-commercial or research purposes under this license. Possible research areas and tasks include - Research on generative models. - Research on real-time applications of generative models. - Research on the impact of real-time generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. For commercial use, please refer to https://stability.ai/membership. SDXL-Turbo does not make use of `guidancescale` or `negativeprompt`, we disable it with `guidancescale=0.0`. Preferably, the model generates images of size 512x512 but higher image sizes work as well. A single step is enough to generate high quality images. When using SDXL-Turbo for image-to-image generation, make sure that `numinferencesteps` `strength` is larger or equal to 1. The image-to-image pipeline will run for `int(numinferencesteps strength)` steps, e.g. 0.5 2.0 = 1 step in our example below. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's Acceptable Use Policy. Limitations - The generated images are of a fixed resolution (512x512 pix), and the model does not achieve perfect photorealism. - The model cannot render legible text. - Faces and people in general may not be generated properly. - The autoencoding part of the model is lossy. The model is intended for both non-commercial and commercial usage. Check out https://github.com/Stability-AI/generative-models
whisper-large-v3-turbo-MLX
Run them directly with nexa-sdk installed In nexa-sdk CLI: Overview Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation. You can find more details about it in this GitHub discussion. Reference Original model card: openai/whisper-large-v3-turbo
granite-4.0-micro-GGUF
DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant
DeepSeek-OCR-GGUF-CUDA
> [!NOTE] > Note currently only NexaSDK supports this model's GGUF. 1. Install NexaSDK 2. Run the model locally with one line of code: 3. Then drag your image to terminal or type into the image path Model Description DeepSeek OCR is a high-accuracy optical character recognition model built for extracting text from complex visual inputs such as documents, screenshots, receipts, and natural scenes. It combines vision-language modeling with efficient visual encoders to achieve superior recognition of multi-language and multi-layout text while remaining lightweight enough for edge or on-device deployment. Features - Multilingual OCR — recognizes printed and handwritten text across major global languages. - Document Layout Understanding — preserves structure such as tables, paragraphs, and titles. - Scene Text Recognition — robust against lighting, distortion, and low-quality captures. - Lightweight & Fast — optimized for CPU and GPU acceleration. - End-to-End Pipeline — supports image-to-text and structured JSON output. Use Cases - Digitizing scanned documents or PDFs - Extracting text from mobile camera inputs or screenshots - Invoice and receipt parsing - OCR-based search and indexing systems - Visual question answering or document agents Inputs and Outputs Input: - Image file (JPEG, PNG, or tensor array) - Optional parameters for language hints or layout detection Output: - Extracted text (plain text or structured format with bounding boxes) - Confidence scores per word or region Integration DeepSeek OCR can be integrated through: - Python API (`pip install deepseek-ocr`) - REST or gRPC endpoints for server deployment License This model is released under the Apache 2.0 License, allowing commercial use, modification, and redistribution with attribution.
qwen3vl-4B-Instruct-4bit-mlx
Qwen3-VL-4B-Instruct Run Qwen3-VL-4B-Instruct optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Instruct is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue. The Instruct variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts. Features - Instruction-Following: Optimized for dialogue, explanation, and user-friendly task completion. - Vision-Language Fusion: Understands and reasons across text, images, and video frames. - Multilingual Capability: Handles multiple languages for diverse global use cases. - Contextual Coherence: Balances reasoning ability with natural, grounded conversational tone. - Lightweight & Deployable: 4B parameters make it efficient for edge and device-level inference. Use Cases - Visual chatbots and assistants - Image captioning and scene understanding - Chart, document, or screenshot analysis - Educational or tutoring systems with visual inputs - Multilingual, multimodal question answering Inputs and Outputs Input: - Text prompts, image(s), or mixed multimodal instructions. Output: - Natural-language responses or visual reasoning explanations. - Can return structured text (summaries, captions, answers, etc.) depending on the prompt. License Refer to the official Qwen license for terms of use and redistribution.
octo-planner-gguf
qwen3vl-4B-Instruct-fp16-mlx
Qwen3-VL-4B-Instruct Run Qwen3-VL-4B-Instruct optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Instruct is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue. The Instruct variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts. Features - Instruction-Following: Optimized for dialogue, explanation, and user-friendly task completion. - Vision-Language Fusion: Understands and reasons across text, images, and video frames. - Multilingual Capability: Handles multiple languages for diverse global use cases. - Contextual Coherence: Balances reasoning ability with natural, grounded conversational tone. - Lightweight & Deployable: 4B parameters make it efficient for edge and device-level inference. Use Cases - Visual chatbots and assistants - Image captioning and scene understanding - Chart, document, or screenshot analysis - Educational or tutoring systems with visual inputs - Multilingual, multimodal question answering Inputs and Outputs Input: - Text prompts, image(s), or mixed multimodal instructions. Output: - Natural-language responses or visual reasoning explanations. - Can return structured text (summaries, captions, answers, etc.) depending on the prompt. License Refer to the official Qwen license for terms of use and redistribution.
Qwen3-VL-4B-Instruct-NPU
Qwen3-VL-4B-Instruct Run Qwen3-VL-4B-Instruct optimized for Qualcomm NPUs with nexaSDK. 1. Install NexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Qwen3-VL-4B-Instruct is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue. The Instruct variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts. Features - Instruction-Following: Optimized for dialogue, explanation, and user-friendly task completion. - Vision-Language Fusion: Understands and reasons across text, images, and video frames. - Multilingual Capability: Handles multiple languages for diverse global use cases. - Contextual Coherence: Balances reasoning ability with natural, grounded conversational tone. - Lightweight & Deployable: 4B parameters make it efficient for edge and device-level inference. Use Cases - Visual chatbots and assistants - Image captioning and scene understanding - Chart, document, or screenshot analysis - Educational or tutoring systems with visual inputs - Multilingual, multimodal question answering Inputs and Outputs Input: - Text prompts, image(s), or mixed multimodal instructions. Output: - Natural-language responses or visual reasoning explanations. - Can return structured text (summaries, captions, answers, etc.) depending on the prompt. License Refer to the official Qwen license for terms of use and redistribution.
jina-v2-rerank-npu
Run Jina Reranker v2 optimized for Qualcomm NPUs with nexaSDK. 1. Install NexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Description Jina Reranker v2 Base Multilingual is a multilingual cross-encoder model for document reranking. Given a query–document pair, it outputs a relevance score to improve ranking in retrieval systems. Features - Cross-encoder architecture for fine-grained relevance scoring - Supports multilingual inputs - Handles inputs up to 1024 tokens using sliding window chunking - Employs flash attention optimizations Use Cases - Reranking candidate passages in multilingual search - Enhancing retrieval in QA / RAG pipelines - Improving semantic relevance in recommendation systems Inputs & Outputs - Input: Query & document (text pair) - Output: Scalar relevance score (for ranking) License This model is licensed under CC BY-NC 4.0, intended for research and evaluation use. Commercial use requires separate arrangement. References - Model page on Hugging Face - Jina AI documentation / model site
deepSeek-r1-distill-qwen-1.5B-intel-npu
Run DeepSeek-r1-distill-qwen-1.5B optimized for Intel NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description deepSeek-r1-distill-qwen-1.5B is a distilled variant of DeepSeek-R1, built on the Qwen-1.5B architecture. It compresses the reasoning and instruction-following capabilities of larger DeepSeek models into an ultra-lightweight 1.5B parameter model—ideal for fast, efficient deployment on constrained devices while retaining strong performance for its size. Features - Distilled from DeepSeek-R1: Maintains core reasoning and comprehension strengths in a smaller model. - Instruction-tuned: Optimized for Q&A, task completion, and logical reasoning. - Compact footprint: 1.5B parameters enable deployment in edge and mobile contexts. - Multilingual support: Handles a wide range of global languages with efficiency. Use Cases - Lightweight conversational agents and personal assistants. - Coding help and small-scale algorithmic reasoning. - Multilingual Q&A or translation in resource-limited environments. - Edge, mobile, and offline applications where compute or memory is limited. Inputs and Outputs Input: Text prompts including natural language queries, tasks, or code snippets. Output: Direct responses—answers, explanations, or code—without extra reasoning annotations. References - Model card: https://huggingface.co/deepseek-ai/deepseek-r1-distill-qwen-1.5b
OmniNeural-4B
OmniNeural — World’s First NPU-aware Multimodal Model Overview OmniNeural is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands text, images, and audio, and runs across PCs, mobile devices, automobile, IoT, and robotics. 📱 Mobile Phone NPU - Demo on Samsung S25 Ultra The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running natively on Snapdragon NPU for long battery life and low latency. 🖼️ Multi-Image Reasoning Spot the difference across two images in multi-round dialogue. 🤖 Image + Text → Function Call Snap a poster, add a text instruction, and AI agent creates a calendar event. 🎶 Multi-Audio Comparison Tell the difference between two music clips locally. Key Features - Multimodal Intelligence – Processes text, image, and audio in a unified model for richer reasoning and perception. - NPU-Optimized Architecture – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — 20% faster than non-NPU-aware models . - Hardware-Aware Attention – Attention patterns tuned for NPU, lowering compute and memory demand . - Native Static Graph – Supports variable-length multimodal inputs with stable, predictable latency . - Performance Gains – 9× faster audio processing and 3.5× faster image processing on NPUs compared to baseline encoders . - Privacy-First Inference – All computation stays local: private, offline-capable, and cost-efficient. Performance / Benchmarks Human Evaluation (vs baselines) - Vision: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B. - Audio: Clear lead over baselines, much better than Gemma3n and Apple foundation model. - Text: Matches or outperforms leading multimodal baselines. Nexa Attention Speedups - 9× faster audio encoding (vs Whisper encoder). - 3.5× faster image encoding (vs SigLIP encoder). Architecture Overview OmniNeural’s design is tightly coupled with NPU hardware: - NPU-friendly ops (ReLU > GELU/SILU). - Sparse + small tensor multiplications for efficiency. - Convolutional layers favored over linear for better NPU parallelization. - Hardware-aware attention patterns to cut compute cost. - Static graph execution for predictable latency. - PC & Mobile – On-device AI agents combine voice, vision, and text for natural, accurate responses. - Examples: Summarize slides into an email (PC), extract action items from chat (mobile). - Benefits: Private, offline, battery-efficient. - Automotive – In-car assistants handle voice control, cabin safety, and environment awareness. - Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction). - Benefits: Decisions run locally in milliseconds. - IoT & Robotics – Multimodal sensing for factories, AR/VR, drones, and robots. - Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction. - Benefits: Works without network connectivity. > ⚠️ Hardware requirement: OmniNeural-4B currently runs only on Qualcomm NPUs (e.g., Snapdragon-powered AIPC). > Apple NPU support is planned next. - Download and follow the steps under "Deploy Section" Nexa's model page: Download Windows arm64 SDK - (Other platforms coming soon) 2) Get an access token Create a token in the Model Hub, then log in: /mic mode. Once the model is running, you can type below to record your voice directly in terminal For images and audio, simply drag your files into the command line. Remember to leave space between file paths. - Issues / Feedback: Use the HF Discussions tab or submit an issue in our discord or nexa-sdk github. - Roadmap & updates: Follow us on X and Discord. > If you want to see more NPU-first, multimodal releases on HF, please give our model a like ❤️. Limitation The current model is mainly optimized for English. We will optimize other language as the next step. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected].
qwen3vl-4B-Thinking-4bit-mlx
Qwen3-VL-4B-Thinking Run Qwen3-VL-4B-Thinking optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal large language model from the Qwen team at Alibaba Cloud. Part of the Qwen3-VL (Vision-Language) family, it is designed for advanced visual reasoning and chain-of-thought generation across image, text, and video inputs. Compared to the Instruct variant, the Thinking model emphasizes deeper multi-step reasoning, analysis, and planning. It produces detailed, structured outputs that reflect intermediate reasoning steps, making it well-suited for research, multimodal understanding, and agentic workflows. Features - Vision-Language Understanding: Processes images, text, and videos for joint reasoning tasks. - Structured Thinking Mode: Generates intermediate reasoning traces for better transparency and interpretability. - High Accuracy on Visual QA: Performs strongly on visual question answering, chart reasoning, and document analysis benchmarks. - Multilingual Support: Understands and responds in multiple languages. - Optimized for Efficiency: Delivers strong performance at 4B scale for on-device or edge deployment. Use Cases - Multimodal reasoning and visual question answering - Scientific and analytical reasoning tasks involving charts, tables, and documents - Step-by-step visual explanation or tutoring - Research on interpretability and chain-of-thought modeling - Integration into agent systems that require structured reasoning Inputs and Outputs Input: - Text, images, or combined multimodal prompts (e.g., image + question) Output: - Generated text, reasoning traces, or structured responses - May include explicit thought steps or structured JSON reasoning sequences License Check the official Qwen license for terms of use and redistribution.
Llama3.2-3B-NPU-Turbo
Llama3.2-3B Run Llama3.2-3B optimized for Qualcomm NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Llama3.2-3B is a 3-billion-parameter language model from Meta’s Llama 3.2 series. It is designed to provide a balance of efficiency and capability, making it suitable for deployment on a wide range of devices while maintaining strong performance on core language understanding and generation tasks. Trained on diverse, high-quality datasets, Llama3.2-3B supports multiple languages and is optimized for scalability, fine-tuning, and real-world applications. Features - Lightweight yet capable: delivers strong performance with a smaller memory footprint. - Conversational AI: context-aware dialogue for assistants and agents. - Content generation: text completion, summarization, code comments, and more. - Reasoning & analysis: step-by-step problem solving and explanation. - Multilingual: supports understanding and generation in multiple languages. - Customizable: can be fine-tuned for domain-specific or enterprise use. Use Cases - Personal and enterprise chatbots - On-device AI applications - Document and report summarization - Education and tutoring tools - Specialized models in verticals (e.g., healthcare, finance, legal) Inputs and Outputs Input: - Text prompts or conversation history (tokenized input sequences). Output: - Generated text: responses, explanations, or creative content. - Optionally: raw logits/probabilities for advanced downstream tasks. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected]. References - Meta AI – Llama Models - Hugging Face Model Card
Kokoro-82M-bf16-MLX
gpt-oss-20b-MLX-4bit
qwen3vl-4B-Thinking-fp16-mlx
Qwen3-VL-4B-Thinking Run Qwen3-VL-4B-Thinking optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-4B-Thinking is a 4-billion-parameter multimodal large language model from the Qwen team at Alibaba Cloud. Part of the Qwen3-VL (Vision-Language) family, it is designed for advanced visual reasoning and chain-of-thought generation across image, text, and video inputs. Compared to the Instruct variant, the Thinking model emphasizes deeper multi-step reasoning, analysis, and planning. It produces detailed, structured outputs that reflect intermediate reasoning steps, making it well-suited for research, multimodal understanding, and agentic workflows. Features - Vision-Language Understanding: Processes images, text, and videos for joint reasoning tasks. - Structured Thinking Mode: Generates intermediate reasoning traces for better transparency and interpretability. - High Accuracy on Visual QA: Performs strongly on visual question answering, chart reasoning, and document analysis benchmarks. - Multilingual Support: Understands and responds in multiple languages. - Optimized for Efficiency: Delivers strong performance at 4B scale for on-device or edge deployment. Use Cases - Multimodal reasoning and visual question answering - Scientific and analytical reasoning tasks involving charts, tables, and documents - Step-by-step visual explanation or tutoring - Research on interpretability and chain-of-thought modeling - Integration into agent systems that require structured reasoning Inputs and Outputs Input: - Text, images, or combined multimodal prompts (e.g., image + question) Output: - Generated text, reasoning traces, or structured responses - May include explicit thought steps or structured JSON reasoning sequences License Check the official Qwen license for terms of use and redistribution.
DeepSeek-R1-Distill-Llama-8B-NexaQuant
embeddinggemma-300m-npu
Model Description EmbeddingGemma is a 300M-parameter open embedding model developed by Google DeepMind. It is built from Gemma 3 (with T5Gemma initialization) and the same research and technology used in Gemini models. The model produces vector representations of text, making it well-suited for search, retrieval, classification, clustering, and semantic similarity tasks. It was trained on 100+ languages with ~320B tokens, optimized for on-device efficiency (mobile, laptops, desktops). Features - Compact and efficient: 300M parameters, optimized for on-device use. - Multilingual: trained on 100+ spoken languages. - Flexible embeddings: default dimension 768, with support for 512, 256, 128 via Matryoshka Representation Learning (MRL). - Wide task coverage: retrieval, QA, fact-checking, classification, clustering, similarity. - Commercial-friendly: open weights available for research and production. Use Cases - Semantic similarity and recommendation systems - Document, code, and web search - Clustering for organization, research, and anomaly detection - Classification (e.g., sentiment, spam detection) - Fact verification and QA embeddings - Code retrieval for programming assistance Inputs and Outputs Input: - Type: Text string (e.g., query, prompt, document) - Max Length: 2048 tokens Output: - Type: Embedding vector (default 768d) - Options: 512 / 256 / 128 dimensions via truncation & re-normalization (MRL) Limitations & Responsible Use This model has known limitations: - Bias & coverage: quality depends on training data diversity. - Nuance & ambiguity: may struggle with sarcasm, figurative language. - Ethical concerns: risk of bias perpetuation, privacy leakage, or malicious misuse. Mitigations: - CSAM and sensitive data filtering applied. - Users should adhere to Gemma Responsible AI guidelines and Prohibited Use Policy. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected]. Support For SDK-related issues, visit sdk.nexa.ai. For model-specific questions, open an issue in this repository.
qwen3vl-8B-Instruct-4bit-mlx
Qwen3-VL-8B-Instruct Run Qwen3-VL-8B-Instruct optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-8B-Instruct is an 8-billion-parameter instruction-tuned multimodal large language model developed by the Qwen team at Alibaba Cloud. It belongs to the Qwen3-VL series, designed for seamless understanding and reasoning across text, image, and video. This version combines the visual intelligence of Qwen3-VL with the instruction-following capabilities of Qwen3-LM, enabling natural, grounded conversations around complex visual content. Compared to the 4B variant, the 8B model delivers stronger reasoning, richer context retention, and improved performance on visual and multilingual benchmarks while maintaining efficiency for deployment. Features - Enhanced Visual Understanding: Handles complex scenes, documents, and multi-image inputs. - Instruction-Tuned Dialogue: Produces coherent and context-aware responses aligned with user intent. - Multilingual Support: Capable of understanding and generating in multiple languages. - Extended Context Window: Supports longer text and multimodal contexts for better reasoning continuity. - Optimized Performance: Balances large-scale reasoning capability with deployability for high-end edge or server environments. Use Cases - Visual chatbots and multimodal assistants - Document and chart interpretation - Image-grounded content generation and summarization - Video frame reasoning and analysis - Multilingual multimodal tutoring or knowledge assistants Inputs and Outputs Input: - Text, images, or combined multimodal prompts - Optional video frames or sequential image sets Output: - Natural-language answers, summaries, captions, or structured reasoning outputs - Can provide visual explanations or reasoning narratives when prompted License See the official Qwen license for details on usage and redistribution.
Prefect-illustrious-XL-v2.0p
Granite-4.0-h-350M-NPU
Run Granite-4.0-h-350M optimized for Qualcomm Hexagon NPUs with NexaSDK. 1. Install NexaSDK 2. Activate your device with your access token for free at sdk.nexa.ai Model Description Granite-4.0-h-350M is a 350-million-parameter transformer model from IBM’s Granite 4.0 family — designed for efficient inference, low-latency edge deployment, and instruction following at compact scale. It shares the same data quality, architecture design, and alignment pipeline as larger Granite 4.0 models but is optimized for lightweight environments where performance per watt and model size are critical. Built on the Granite 4.0 foundation, this model continues IBM’s commitment to open, responsible AI, offering transparency and adaptability for developers, researchers, and embedded AI applications. Features - Compact yet capable: Delivers high-quality generation and reasoning with just 350M parameters. - Instruction-tuned: Follows natural language instructions for diverse tasks. - Low-latency performance: Ideal for CPU, GPU, and NPU inference. - Efficient deployment: Runs smoothly on edge and resource-constrained devices. - Open and transparent: Released under IBM’s open model governance framework. Use Cases - On-device assistants and chatbots - Edge AI and IoT inference - Document and text summarization - Education and lightweight reasoning tasks - Prototype fine-tuning for domain adaptation Inputs and Outputs Input: - Text prompt (instruction or question) Output: - Generated text response completing or following the input prompt License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected].
jan-v1-4B-npu
phi4-mini-npu-turbo
Run Phi-4-mini optimized for Qualcomm NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Phi-4-mini is a \~3.8B-parameter instruction-tuned model from Microsoft’s Phi-4 family. Trained on a blend of synthetic “textbook-style” data, filtered public web content, curated books/Q\&A, and high-quality supervised chat data, it emphasizes reasoning-dense capabilities while maintaining a compact footprint. This NPU Turbo build uses Nexa’s Qualcomm backend (QNN/Hexagon) to deliver lower latency and higher throughput on-device, with support for 128K context and efficient long-context memory handling. Lightweight yet capable: strong reasoning (math/logic) in a compact 3.8B model. Instruction-following: enhanced SFT + DPO alignment for reliable chat. Content generation: drafting, completion, summarization, code comments, and more. Conversational AI: context-aware assistants/agents with long-context support (128K). NPU-Turbo path: INT8/INT4 quantization, op fusion, and KV-cache residency for Snapdragon® NPUs via nexaSDK. Customizable: fine-tune/adapt for domain-specific or enterprise use. Personal & enterprise chatbots On-device/offline assistants (latency-bound scenarios) Document/report/email summarization Education, tutoring, and STEM reasoning tools Vertical applications (e.g., healthcare, finance, legal) with appropriate safeguards Text prompts or conversation history (chat-format, tokenized sequences). Generated text: responses, explanations, or creative content. Optionally: raw logits/probabilities for advanced downstream tasks. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected]. References 📰 Phi-4-mini Microsoft Blog 📖 Phi-4-mini Technical Report 👩🍳 Phi Cookbook 🚀 Model paper
parakeet-tdt-0.6b-v3-npu
Model Description parakeet-tdt-0.6b-v3 is a 600-million-parameter multilingual automatic speech recognition (ASR) model developed by NVIDIA. It extends the parakeet-tdt-0.6b-v2 model by expanding from English-only support to 25 European languages. The model automatically detects the spoken language and transcribes speech to text without requiring additional prompting. It was trained primarily on the Granary multilingual corpus [1,2] and is optimized for both research and production-grade deployment. Features - Multilingual ASR: Supports 25 European languages with automatic language detection. - Automatic punctuation & capitalization included. - Timestamps: Accurate word-level and segment-level timestamps. - Long audio support: - Up to 24 minutes with full attention (A100 80GB). - Up to 3 hours with local attention. - Commercial-friendly: Released under the permissive CC-BY-4.0 license. Supported Languages Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk) Use Cases - Conversational AI and multilingual chatbots - Voice assistants - Transcription services - Subtitles and caption generation - Voice analytics platforms - Academic and industry research on speech technologies Inputs and Outputs Input: - Type: 16kHz audio - Formats: `.wav`, `.flac` - Shape: 1D audio signal (mono channel) Output: - Type: Text string - Properties: Includes punctuation and capitalization This model may produce transcription or generation errors. Evaluate carefully in production, especially in sensitive domains (healthcare, legal, finance). License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected]. For SDK-related issues, visit sdk.nexa.ai. For model-specific questions, open an issue in this repository.
qwen3vl-8B-Instruct-fp16-mlx
Qwen3-VL-8B-Instruct Run Qwen3-VL-8B-Instruct optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-8B-Instruct is an 8-billion-parameter instruction-tuned multimodal large language model developed by the Qwen team at Alibaba Cloud. It belongs to the Qwen3-VL series, designed for seamless understanding and reasoning across text, image, and video. This version combines the visual intelligence of Qwen3-VL with the instruction-following capabilities of Qwen3-LM, enabling natural, grounded conversations around complex visual content. Compared to the 4B variant, the 8B model delivers stronger reasoning, richer context retention, and improved performance on visual and multilingual benchmarks while maintaining efficiency for deployment. Features - Enhanced Visual Understanding: Handles complex scenes, documents, and multi-image inputs. - Instruction-Tuned Dialogue: Produces coherent and context-aware responses aligned with user intent. - Multilingual Support: Capable of understanding and generating in multiple languages. - Extended Context Window: Supports longer text and multimodal contexts for better reasoning continuity. - Optimized Performance: Balances large-scale reasoning capability with deployability for high-end edge or server environments. Use Cases - Visual chatbots and multimodal assistants - Document and chart interpretation - Image-grounded content generation and summarization - Video frame reasoning and analysis - Multilingual multimodal tutoring or knowledge assistants Inputs and Outputs Input: - Text, images, or combined multimodal prompts - Optional video frames or sequential image sets Output: - Natural-language answers, summaries, captions, or structured reasoning outputs - Can provide visual explanations or reasoning narratives when prompted License See the official Qwen license for details on usage and redistribution.
llama3.2-1B-intel-npu
Run Llama-3.2-1B optimized for Intel NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Llama-3.2-1B is the smallest model in the Llama 3.2 family, optimized for efficiency and ultra-lightweight deployment. With just 1B parameters, it enables fast inference on resource-constrained environments while retaining strong instruction-following and multilingual capabilities for its size. Features - Ultra-compact design: 1B parameters for minimal memory and compute requirements. - Instruction-tuned: Capable of following prompts and answering questions reliably. - Multilingual support: Handles a wide set of languages despite small scale. - Edge-ready: Runs efficiently on laptops, mobile devices, and other constrained hardware. Use Cases - On-device conversational agents and personal assistants. - Educational apps or lightweight tutoring systems. - Prototyping with LLMs in environments where compute or cost is heavily constrained. - Offline or embedded applications where larger models are impractical. Inputs and Outputs Input: Text prompts such as questions, instructions, or code snippets. Output: Concise natural language responses, answers, or explanations. License - Licensed under Meta Llama 3.2 Community License References - Model card: https://huggingface.co/meta-llama/Llama-3.2-1B
gemma-3n-E4B-it-4bit-MLX
deepSeek-r1-distill-qwen-7B-intel-npu
Run DeepSeek-r1-distill-qwen-7B optimized for Intel NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description deepSeek-r1-distill-qwen-7B is a distilled variant of DeepSeek-R1, built on the Qwen-7B architecture. It is designed for efficient reasoning and instruction-following while maintaining strong performance across coding, logic, and multilingual tasks. Distillation compresses the capabilities of larger DeepSeek models into a lighter 7B parameter model, making it more practical for edge deployment and resource-constrained environments. Features - Distilled from DeepSeek-R1: Retains core reasoning strengths in a smaller, faster footprint. - Instruction-tuned: Optimized for comprehension, logic, and task completion. - Multilingual coverage: Handles diverse language inputs with improved efficiency. - Compact yet capable: Balances performance with deployability on a wide range of hardware. Use Cases - Conversational AI and instruction-following assistants. - Coding support, debugging, and algorithmic reasoning. - Multilingual content generation and translation. - Lightweight deployment on edge or limited-resource devices. Inputs and Outputs Input: Text prompts including natural language queries, instructions, or code snippets. Output: Direct responses—answers, explanations, code, or translations—without extra reasoning annotations. References - Model card: https://huggingface.co/deepseek-ai/deepseek-r1-distill-qwen-7b
llama-3.1-8B-intel-npu
Llama-3.1-8B Run Llama-3.1-8B optimized for Intel NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Llama-3.1-8B is a mid-sized model in the Llama 3.1 family, balancing strong reasoning and language understanding with efficient deployment. At 8B parameters, it offers significantly higher accuracy and fluency than smaller Llama models, while remaining practical for fine-tuning and inference on modern GPUs. Features - Balanced scale: 8B parameters provide a strong trade-off between performance and efficiency. - Instruction-tuned: Optimized for following prompts, Q&A, and detailed reasoning. - Multilingual capabilities: Broad support across global languages. - Developer-friendly: Available for fine-tuning, domain adaptation, and integration into custom applications. Use Cases - Conversational AI and digital assistants requiring stronger reasoning. - Content generation, summarization, and analysis. - Coding help and structured problem solving. - Research and prototyping in environments where very large models are impractical. Inputs and Outputs Input: Text prompts—questions, instructions, or code snippets. Output: Natural language responses including answers, explanations, structured outputs, or code. License - Licensed under Meta Llama 3.1 Community License References - Model card: https://huggingface.co/meta-llama/Llama-3.1-8B
parakeet-tdt-0.6b-v2-MLX
LFM2.5-1.2B-GGUF
qwen3vl-8B-Thinking-4bit-mlx
Qwen3-VL-8B-Thinking Run Qwen3-VL-8B-Thinking optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-8B-Thinking is an 8-billion-parameter multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL (Vision-Language) family, it is designed for deep multimodal reasoning — combining visual understanding, long-context comprehension, and structured chain-of-thought generation across text, images, and videos. The Thinking variant focuses on advanced reasoning transparency and analytical precision. Compared to the Instruct version, it produces richer intermediate reasoning steps, enabling detailed explanation, planning, and multi-hop analysis across visual and textual inputs. Features - Deep Visual Reasoning: Interprets complex scenes, charts, and documents with multi-step logic. - Chain-of-Thought Generation: Produces structured reasoning traces for improved interpretability and insight. - Extended Context Handling: Maintains coherence across longer multimodal sequences. - Multilingual Competence: Understands and generates in multiple languages for global applicability. - High Accuracy at 8B Scale: Achieves strong benchmark performance in multimodal reasoning and analysis tasks. Use Cases - Research and analysis requiring visual reasoning transparency - Complex multimodal QA and scientific problem solving - Visual analytics and explanation generation - Advanced agent systems needing structured thought or planning steps - Educational tools requiring detailed, interpretable reasoning Inputs and Outputs Input: - Text, image(s), or multimodal combinations (including sequential frames or documents) - Optional context for multi-turn or multi-modal reasoning Output: - Structured reasoning outputs with intermediate steps - Detailed answers, explanations, or JSON-formatted reasoning traces License Refer to the official Qwen license for usage and redistribution details.
gpt-oss-20b-MLX-8bit
llama3.2-3B-intel-npu
Llama-3.2-3B Run Llama-3.2-3B optimized for Intel NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Llama-3.2-3B is a compact member of the Llama 3.2 family, designed to provide strong general-purpose language modeling in a lightweight 3B parameter footprint. It balances efficiency with capability, making it well-suited for edge devices, prototyping, and applications where latency and resource constraints are critical. Features - Lightweight architecture: 3B parameters optimized for fast inference and low memory usage. - Instruction-following: Tuned for prompts, Q&A, and step-by-step reasoning. - Multilingual capabilities: Covers a wide range of global languages at smaller scale. - Deployment flexibility: Runs efficiently on consumer hardware and server environments. Use Cases - Conversational assistants and chatbots. - Educational tools and lightweight tutoring systems. - Prototyping and experimentation with large language models on limited resources. - Applications where cost or latency is a priority over sheer scale. Inputs and Outputs Input: Text prompts—questions, commands, or code snippets. Output: Natural language responses including answers, explanations, or structured outputs. License - Licensed under Meta Llama 3.2 Community License References - Model card: https://huggingface.co/meta-llama/Llama-3.2-3B
qwen3vl-8B-Thinking-fp16-mlx
Qwen3-VL-8B-Thinking Run Qwen3-VL-8B-Thinking optimized for Apple Silicon on MLX with NexaSDK. 1. Install NexaSDK 2. Run the model locally with one line of code: Model Description Qwen3-VL-8B-Thinking is an 8-billion-parameter multimodal large language model from Alibaba Cloud’s Qwen team. As part of the Qwen3-VL (Vision-Language) family, it is designed for deep multimodal reasoning — combining visual understanding, long-context comprehension, and structured chain-of-thought generation across text, images, and videos. The Thinking variant focuses on advanced reasoning transparency and analytical precision. Compared to the Instruct version, it produces richer intermediate reasoning steps, enabling detailed explanation, planning, and multi-hop analysis across visual and textual inputs. Features - Deep Visual Reasoning: Interprets complex scenes, charts, and documents with multi-step logic. - Chain-of-Thought Generation: Produces structured reasoning traces for improved interpretability and insight. - Extended Context Handling: Maintains coherence across longer multimodal sequences. - Multilingual Competence: Understands and generates in multiple languages for global applicability. - High Accuracy at 8B Scale: Achieves strong benchmark performance in multimodal reasoning and analysis tasks. Use Cases - Research and analysis requiring visual reasoning transparency - Complex multimodal QA and scientific problem solving - Visual analytics and explanation generation - Advanced agent systems needing structured thought or planning steps - Educational tools requiring detailed, interpretable reasoning Inputs and Outputs Input: - Text, image(s), or multimodal combinations (including sequential frames or documents) - Optional context for multi-turn or multi-modal reasoning Output: - Structured reasoning outputs with intermediate steps - Detailed answers, explanations, or JSON-formatted reasoning traces License Refer to the official Qwen license for usage and redistribution details.
Qwen3-4B-4bit-MLX
Granite-4-Micro-NPU
Granite-4.0-Micro Run Granite-4.0-Micro optimized for Qualcomm NPUs with nexaSDK. 1. Install NexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description Granite-4.0-Micro is a 3B parameter instruction-tuned model in the Granite 4.0 family, developed by IBM. It’s optimized for long-context reasoning (128K tokens), efficient inference, and enterprise-ready capabilities such as tool calling and retrieval-augmented generation. The model balances compact size with strong performance across general NLP tasks, making it suitable for both experimentation and production workloads. Features - Compact transformer architecture: 3B parameters with GQA, RoPE, SwiGLU, and RMSNorm layers. - Instruction-following & tool calling: Tuned with supervised finetuning, alignment (RLHF), and model merging for robust enterprise tasks. - Multilingual support: Covers 12+ languages including English, German, Spanish, French, Japanese, Korean, Arabic, and Chinese. - Extended context window: Supports sequences up to 128K tokens for long-form reasoning. Use Cases - Conversational AI and virtual assistants. - Enterprise applications needing tool/API calling and structured outputs. - Long-document summarization, classification, and extraction. - Retrieval-augmented generation (RAG) for knowledge-intensive workflows. - Lightweight coding assistants and multilingual dialog systems. Inputs and Outputs Input: Natural language text prompts, chat conversations, or tool-augmented requests. Output: Natural language responses—answers, explanations, summaries, structured JSON for function calls, or code snippets. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected].
qwen3-4B-npu
phi3.5-mini-npu
Run Phi-3.5-Mini optimized for Qualcomm NPUs with nexaSDK. 1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Phi-3.5-Mini is a \~3.8B-parameter instruction-tuned language model from Microsoft’s Phi family. It’s designed to deliver strong reasoning and instruction-following quality within a compact footprint, making it ideal for on-device and latency-sensitive applications. This Turbo build uses Nexa’s Qualcomm NPU path for faster inference and higher throughput while preserving model quality. Lightweight yet capable: strong performance with small memory and compute budgets. Conversational AI: context-aware dialogue for assistants and agents. Content generation: drafting, completion, summarization, code comments, and more. Reasoning & analysis: math/logic step-by-step problem solving. Multilingual: supports understanding and generation across multiple languages. Customizable: fine-tune or apply adapters for domain-specific use. Personal and enterprise chatbots On-device AI applications and offline assistants Document/report/email summarization Education and tutoring tools Vertical solutions (e.g., healthcare, finance, legal), with proper guardrails Text prompts or conversation history (tokenized input sequences). Generated text: responses, explanations, or creative content. Optionally: raw logits/probabilities for advanced downstream tasks. Microsoft – Phi Models Hugging Face Model Card (Phi-3.5-Mini-Instruct) Phi-3 Technical Report (blog/overview)
sdxl-base
Qwen3-0.6B-ANE
Gemma3-1B-ANE
OmniNeural-4B-mobile
OmniNeural — World’s First NPU-aware Multimodal Model (Mobile Version) Overview OmniNeural is the first fully multimodal model designed specifically for Neural Processing Units (NPUs). It natively understands text, images, and audio, and runs across PCs, mobile devices, automobile, IoT, and robotics. 📱 Mobile Phone NPU - Demo on Samsung S25 Ultra The first-ever fully local, multimodal, and conversational AI assistant that hears you and sees what you see, running natively on Snapdragon NPU for long battery life and low latency. Key Features - Multimodal Intelligence – Processes text, image, and audio in a unified model for richer reasoning and perception. - NPU-Optimized Architecture – Uses ReLU ops, sparse tensors, convolutional layers, and static graph execution for maximum throughput — 20% faster than non-NPU-aware models . - Hardware-Aware Attention – Attention patterns tuned for NPU, lowering compute and memory demand . - Native Static Graph – Supports variable-length multimodal inputs with stable, predictable latency . - Performance Gains – 9× faster audio processing and 3.5× faster image processing on NPUs compared to baseline encoders . - Privacy-First Inference – All computation stays local: private, offline-capable, and cost-efficient. Performance / Benchmarks Human Evaluation (vs baselines) - Vision: Wins/ties in ~75% of prompts against Apple Foundation, Gemma-3n-E4B, Qwen2.5-Omni-3B. - Audio: Clear lead over baselines, much better than Gemma3n and Apple foundation model. - Text: Matches or outperforms leading multimodal baselines. Nexa Attention Speedups - 9× faster audio encoding (vs Whisper encoder). - 3.5× faster image encoding (vs SigLIP encoder). Architecture Overview OmniNeural’s design is tightly coupled with NPU hardware: - NPU-friendly ops (ReLU > GELU/SILU). - Sparse + small tensor multiplications for efficiency. - Convolutional layers favored over linear for better NPU parallelization. - Hardware-aware attention patterns to cut compute cost. - Static graph execution for predictable latency. - PC & Mobile – On-device AI agents combine voice, vision, and text for natural, accurate responses. - Examples: Summarize slides into an email (PC), extract action items from chat (mobile). - Benefits: Private, offline, battery-efficient. - Automotive – In-car assistants handle voice control, cabin safety, and environment awareness. - Examples: Detects risks (child unbuckled, pet left, loose objects) and road conditions (fog, construction). - Benefits: Decisions run locally in milliseconds. - IoT & Robotics – Multimodal sensing for factories, AR/VR, drones, and robots. - Examples: Defect detection, technician overlays, hazard spotting mid-flight, natural robot interaction. - Benefits: Works without network connectivity. Note this version is for mobile only (Android). See documentation for how to use: - Issues / Feedback: Use the HF Discussions tab or submit an issue in our discord or nexa-sdk github. - Roadmap & updates: Follow us on X and Discord. > If you want to see more NPU-first, multimodal releases on HF, please give our model a like ❤️. Limitation The current model is mainly optimized for English. We will optimize other language as the next step. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected].
yolov12-npu
Qwen3-1.7B-4bit-MLX
SmolVLM-500M-Instruct-8bit-MLX
Pyannote-NPU
gemma-3-4b-it-8bit-MLX
LFM2-1.2B-npu
LFM2-1.2B Run LFM2-1.2B on Qualcomm NPU with NexaSDK. 1. Install NexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description LFM2-1.2B is part of Liquid AI’s second-generation LFM2 family, designed specifically for on-device and edge AI deployment. With 1.2 billion parameters, it strikes a balance between compact size, strong reasoning, and efficient compute utilization—ideal for running on CPUs, GPUs, or NPUs. LFM2 introduces a hybrid Liquid architecture with multiplicative gates and short convolutions, enabling faster convergence and improved contextual reasoning. It demonstrates up to 3× faster training and 2× faster inference on CPU compared to Qwen3, while maintaining superior accuracy across multilingual and instruction-following benchmarks. Features - ⚡ Speed & Efficiency – 2× faster inference and prefill]. - 🧠 Hybrid Liquid Architecture – Combines multiplicative gating with convolutional layers for better reasoning and token reuse. - 🌍 Multilingual Competence – Supports diverse languages for global use cases. - 🛠 Flexible Deployment – Runs efficiently on CPU, GPU, and NPU hardware. - 📈 Benchmark Performance – Outperforms similarly-sized models in math, knowledge, and reasoning tasks. Use Cases - Edge AI assistants and voice agents - Offline reasoning and summarization on mobile or automotive devices - Local code and text generation tools - Lightweight multimodal or RAG pipelines - Domain-specific fine-tuning for vertical applications (e.g., finance, robotics) Inputs and Outputs Input - Text prompts or structured instructions (tokenized sequences for API use). Output - Natural-language or structured text generations. - Optionally: logits or embeddings for advanced downstream integration. License This model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Non-commercial use, modification, and redistribution are permitted with attribution. For commercial licensing, please contact [email protected].
jina-v2-fp16-mlx
parakeet-tdt-0.6b-v3-ane
parakeet-tdt-0.6b-v3 is a 600M-parameter multilingual automatic speech recognition (ASR) model from NVIDIA. It extends parakeet-tdt-0.6b-v2 by moving beyond English-only to support 25 European languages with automatic language detection. The model was primarily trained on the Granary multilingual corpus and is optimized for both research exploration and production deployment. This build is integrated with nexaSDK and optimized for modern NPUs, including Apple’s Neural Engine (ANE), for efficient on-device inference. Multilingual ASR: 25 European languages with built-in language detection. Text formatting: Outputs text with punctuation and capitalization. Timestamps: Provides both word-level and segment-level timestamps. Long audio transcription: Up to 24 minutes with full attention (A100 80GB). Up to 3 hours with local attention. Optimized for NPUs: Runs efficiently on Apple ANE, Qualcomm Hexagon, and other dedicated accelerators. Commercial-friendly: Released under CC-BY-4.0 license. The Apple Neural Engine (ANE) is a specialized NPU in Apple silicon designed to accelerate AI and ML workloads \[3]. By offloading heavy ASR computations to the ANE, parakeet-tdt-0.6b-v3 achieves: Lower latency speech transcription on iPhone, iPad, and Mac. Energy-efficient inference, extending battery life during real-time ASR tasks. On-device privacy, keeping voice data local while maintaining production-grade accuracy. Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk) Conversational AI and multilingual chatbots Voice assistants and smart devices Real-time transcription services Subtitles and caption generation Voice analytics platforms Research in speech technology Type: 16kHz audio Formats: `.wav`, `.mp3` Shape: 1D mono audio Type: Text string Properties: Punctuation + capitalization included The model may produce transcription errors, particularly with code-switching or noisy input. Evaluate thoroughly before deploying in sensitive domains (e.g., healthcare, finance, or legal). Licensed under the original Parakeet license terms. See: Parakeet Model License
gemma-3n-E2B-it-4bit-MLX
Qwen3-0.6B-bf16-MLX
Qwen2.5-VL-7B-Instruct-4bit-MLX
SmolVLM-Instruct-8bit-MLX
paddleocr-npu
AutoNeural
Qwen3-0.6B-8bit-MLX
Squid
LFM2-24B-A2B-GGUF
convnext-tiny-npu-IoT
sdxl-turbo-amd-npu
HY-MT1.5-1.8B-npu
qwen3-0.6b-ane
octo-planner-2b
convnext-tiny-npu
1. Install nexaSDK and create a free account at sdk.nexa.ai 2. Activate your device with your access token: Model Description ConvNeXt-Tiny is a lightweight convolutional neural network (CNN) developed by Meta AI, designed to modernize traditional ConvNet architectures with design principles inspired by Vision Transformers (ViTs). With around 28 million parameters, it achieves competitive ImageNet performance while remaining efficient for on-device and edge inference. ConvNeXt-Tiny brings transformer-like accuracy to a purely convolutional design — combining modern architectural updates with the efficiency of classical CNNs. Features - High-accuracy Image Classification: Pretrained on ImageNet-1K with strong top-1 accuracy. - Flexible Backbone: Commonly used as a feature extractor for detection, segmentation, and multimodal systems. - Optimized for Efficiency: Compact model size enables fast inference and low latency on CPUs, GPUs, and NPUs. - Modernized CNN Design: Adopts ViT-inspired improvements such as layer normalization, larger kernels, and inverted bottlenecks. - Scalable Family: Part of the ConvNeXt suite (Tiny, Small, Base, Large, XLarge) for different compute and accuracy trade-offs. Use Cases - Real-time image recognition on edge or mobile devices - Vision backbone for multimodal and perception models - Visual search, tagging, and recommendation systems - Transfer learning and fine-tuning for domain-specific tasks - Efficient deployment in production or research environments Inputs and Outputs Input: - RGB image tensor (usually `3 × 224 × 224`) - Normalized using ImageNet mean and standard deviation Output: - 1000-dimensional logits for ImageNet class probabilities - Optional intermediate feature maps when used as a backbone License - All NPU-related components of this project — including code, models, runtimes, and configuration files under the src/npu/ and models/npu/ directories — are licensed under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) license. - Commercial licensing or usage rights must be obtained through a separate agreement. For inquiries regarding commercial use, please contact `[email protected]`