taobao-mnn
Qwen2.5-1.5B-Instruct-MNN
Qwen2.5-VL-3B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen2.5-VL-3B-Instruct using llmexport.
Llama-3.2-1B-Instruct-MNN
Qwen3-VL-4B-Thinking-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-4B-Thinking using llmexport.
Qwen3-VL-4B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-4B-Instruct using llmexport.
Qwen2.5-Omni-3B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen2.5-Omni-3B using llmexport.
Qwen3-4B-Instruct-2507-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-4B-Instruct-2507 using llmexport.
bert-vits2-MNN
gpt-oss-20b-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from gpt-oss-20b using llmexport.
Qwen3-4B-Thinking-2507-MNN
Qwen3.5-4B-MNN
Qwen3-VL-8B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-8B-Instruct using llmexport.
gemma-3-1b-it-qat-q4_0-gguf-MNN
DeepSeek-R1-0528-Qwen3-8B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from DeepSeek-R1-0528-Qwen3-8B using llmexport.
Qwen3.5-0.8B-MNN
Qwen3-VL-8B-Thinking-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-8B-Thinking using llmexport.
Qwen2.5-Omni-7B-MNN
Qwen3-0.6B-MNN
Qwen3.5-2B-MNN
Qwen3-1.7B-MNN
SmolVLM-256M-Instruct-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from SmolVLM-256M-Instruct using llmexport.
sherpa-mnn-streaming-zipformer-en-2023-02-21
Qwen3-4B-MNN
Qwen2.5-7B-Instruct-MNN
Hunyuan-0.5B-Instruct-MNN
MiniCPM-V-4-MNN
SmolVLM2-500M-Video-Instruct-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from SmolVLM2-500M-Video-Instruct using llmexport.
sherpa-mnn-streaming-zipformer-bilingual-zh-en-2023-02-20
gemma-3-4b-it-q4_0-mnn
Introduction This model is a 4-bit quantized version of the MNN model exported from gemma-3-4b-it-q40 using llmexport.
DeepSeek-R1-1.5B-Qwen-MNN
MiniCPM4-0.5B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from MiniCPM4-0.5B using llmexport.
Qwen3-Coder-30B-A3B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Coder-30B-A3B-Instruct using llmexport.
Qwen3-VL-30B-A3B-Thinking-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-30B-A3B-Thinking using llmexport.
Qwen2.5-0.5B-Instruct-MNN
Hunyuan-1.8B-Instruct-MNN
Qwen3-VL-30B-A3B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-30B-A3B-Instruct using llmexport.
SmolLM3-3B-MNN
WebSailor-3B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from WebSailor-3B using llmexport.
Qwen2.5-Coder-7B-Instruct-MNN
DeepSeek-R1-7B-Qwen-MNN
Qwen3-8B-MNN
Llama-3.2-3B-Instruct-MNN
Qwen2.5-Coder-1.5B-Instruct-MNN
SmolVLM2-256M-Video-Instruct-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from SmolVLM2-256M-Video-Instruct using llmexport.
Hunyuan-4B-Instruct-MNN
deepseek-vl-7b-chat-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from deepseek-vl-7b-chat using llmexport.
Qwen3-30B-A3B-Thinking-2507-MNN
ERNIE-4.5-0.3B-PT-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from ERNIE-4.5-0.3B-PT using llmexport.
MiniCPM4-8B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from MiniCPM4-8B using llmexport.
gemma-2-2b-it-MNN
Qwen2-VL-2B-Instruct-MNN
Lingshu-7B-MNN
MobileLLM-125M-MNN
Qwen3-14B-MNN
SmolVLM2-256M-Video-Instruct-NPU
Introduction This model is a 4-bit quantized version of the MNN model exported from SmolVLM2-256M-Video-Instruct using llmexport.
Hunyuan-7B-Instruct-MNN
Qwen3-4B-SafeRL-MNN
Qwen3-VL-2B-Instruct-Eagle3
InternVL2_5-1B-MNN
Qwen3-Embedding-0.6B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Embedding-0.6B using llmexport.
Meta-Llama-3.1-8B-Instruct-MNN
Qwen3-VL-32B-Thinking-MNN
FastVLM-1.5B-Stage3-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-1.5B-Stage3 using llmexport.
Qwen2.5-3B-Instruct-MNN
TinyLlama-1.1B-Chat-MNN
SmolVLM-500M-Instruct-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from SmolVLM-500M-Instruct using llmexport.
phi-2-MNN
Qwen3Guard-Gen-0.6B-MNN
Qwen3Guard-Stream-8B-MNN
Qwen3-VL-32B-Instruct-MNN
MobileLLM-1B-MNN
SmolVLM2-2.2B-Instruct-MNN
FastVLM-0.5B-Stage3-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-0.5B-Stage3 using llmexport.
Meta-Llama-3.1-8B-Instruct-Eagle3-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Meta-Llama-3.1-8B-Instruct-Eagle3 using llmexport.
Qwen3-4B-Instruct-2507-Eagle3
This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-4B-Instruct-2507` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-4B-Instruct-2507`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-4B-Instruct-2507` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 2.08 tokens (with 4 draft tokens), meaning on average, it helps the base model advance over 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | | :--------- | :-------------------------------: | | gsm8k | 2.22 | | humaneval | 2.29 | | math500 | 2.27 | | cmmlu | 1.94 | | ceval | 1.93 | | mtbench | 1.85 | | Average| ~2.08 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 3 epochs on 8x MI308X GPUs, which took 56 hours and totaled 448 `MI308X GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型,专为加速 `Qwen3-4B-Instruct-2507` 大语言模型的推理而训练。 请注意:这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-4B-Instruct-2507`) 在推测解码 (speculative decoding) 框架下配合使用,才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-4B-Instruct-2507` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下,将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens),然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受,生成过程就能一次性前进多个步骤,从而实现显著的速度提升。 本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 2.08 个词元 (在草稿长度为4时),这意味着在每次验证中,它平均能帮助基座模型推进超过 2 个词元。 本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率,数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | | :------------------ | :-------------------------------: | | gsm8k | 2.22 | | humaneval | 2.29 | | math500 | 2.27 | | cmmlu | 1.94 | | ceval | 1.93 | | mtbench | 1.85 | | 平均值 | ~2.08 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 8x MI308X 训练 3 轮,耗时 56 小时,共 448 `MI308X 卡时`。
MobileLLM-600M-MNN
Qwen3Guard-Gen-4B-MNN
Qwen3Guard-Stream-4B-MNN
Qwen3Guard-Stream-0.6B-MNN
Qwen3-30B-A3B-Instruct-2507-MNN
Hunyuan-MT-7B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Hunyuan-MT-7B using llmexport.
glm-4-9b-chat-MNN
SmolLM2-135M-Instruct-MNN
DeepSeek-Prover-V2-7B-MNN
Qwen3Guard-Gen-8B-MNN
FastVLM-1.5B-Stage2-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-1.5B-Stage2 using llmexport.
Qwen3 VL 2B Instruct MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Instruct using llmexport.
Qwen3.5-27B-MNN
Qwen3-VL-2B-Thinking-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Thinking using llmexport.
gemma-3-270m-it-MNN
Meta-Llama-3-8B-Instruct-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Meta-Llama-3-8B-Instruct using llmexport.
deepseek-llm-7b-chat-MNN
Llama-2-7b-chat-ms-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Llama-2-7b-chat using llmexport.
Qwen3-Embedding-4B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Embedding-4B using llmexport.
SmolLM2-360M-Instruct-MNN
gemma-2-9b-it-MNN
gemma-7b-it-MNN
chatglm3-6b-MNN
MobileLLM-350M-MNN
TinyLlama-1.1B-Chat-v1.0-MNN
Qwen2-VL-7B-Instruct-MNN
Qwen2.5-Math-1.5B-Instruct-MNN
FastVLM-0.5B-Stage2-MNN
Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-0.5B-Stage2 using llmexport.
Qwen3-Embedding-8B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Embedding-8B using llmexport.
Qwen-VL-Chat-MNN
Qwen2.5-VL-7B-Instruct-MNN
Qwen3-32B-MNN
Qwen3-30B-A3B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-30B-A3B using llmexport.
Qwen2-0.5B-Instruct-MNN
Qwen3-4B-Instruct-2507-Eagle3-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-4B-Instruct-2507-Eagle3 using llmexport.
Qwen3-VL-2B-Thinking-Eagle3
This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-VL-2B-Thinking` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-VL-2B-Thinking`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-VL-2B-Thinking` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 1.87 tokens (with 4 draft tokens), meaning on average, it helps the base model advance nearly 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :--------- | :-------------------------------: | :-------------------------------: | | humaneval | 1.80 | 1.85 | | gsm8k | 1.77 | 1.80 | | math500 | 1.75 | 1.81 | | ceval | 1.70 | 1.74 | | cmmlu | 1.65 | 1.70 | | mtbench | 1.61 | 1.65 | | Average| ~1.71 | ~1.76 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 2 epochs on 4x H20 GPUs, which took 27 hours and totaled 108 `H20 GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型,专为加速 `Qwen3-VL-2B-Thinking` 大语言模型的推理而训练。 请注意:这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-VL-2B-Thinking`) 在推测解码 (speculative decoding) 框架下配合使用,才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-VL-2B-Thinking` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下,将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens),然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受,生成过程就能一次性前进多个步骤,从而实现显著的速度提升。 本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 1.71 个词元 (在草稿长度为4时)。 本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率,数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :------------------ | :-------------------------------: | :-------------------------------: | | humaneval | 1.80 | 1.85 | | gsm8k | 1.77 | 1.80 | | math500 | 1.75 | 1.81 | | ceval | 1.70 | 1.74 | | cmmlu | 1.65 | 1.70 | | mtbench | 1.61 | 1.65 | | 平均值 | ~1.71 | ~1.76 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 4x H20 训练 2 轮,耗时 27 小时,共 108 `H20 卡时`。
Qwen-7B-Chat-MNN
internlm-chat-7b-MNN
Baichuan2-7B-Chat-MNN
Qwen2-7B-Instruct-MNN
MiMo-7B-RL-MNN
Qwen3-Reranker-0.6B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Reranker-0.6B using llmexport.
SmolLM2-1.7B-Instruct-MNN
Qwen2-1.5B-Instruct-MNN
Qwen2.5-Math-7B-Instruct-MNN
Yi-6B-Chat-MNN
Qwen2-Audio-7B-Instruct-MNN
MiMo-7B-SFT-MNN
SmolVLM-Instruct-MNN
Qwen3-Reranker-4B-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Reranker-4B using llmexport.
Qwen3-Reranker-8B-MNN
Qwen3-VL-2B-Instruct-Eagle3-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Instruct-Eagle3 using llmexport.
MiMo-7B-Base-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from MiMo-7B-Base using llmexport.
Qwen3.5-35B-A3B-MNN
MiMo-7B-RL-Zero-MNN
Introduction This model is a 4-bit quantized version of the MNN model exported from MiMo-7B-RL-Zero using llmexport.
SmolDocling-256M-preview-MNN
reader-lm-0.5b-MNN
QwQ-32B-MNN
Qwen3.5-9B-MNN
OpenELM-3B-Instruct-MNN
Qwen1.5-4B-Chat-MNN
Qwen1.5-7B-Chat-MNN
bge-large-zh-MNN
Qwen3-VL-4B-Instruct-Eagle3
This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-VL-4B-Instruct` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-VL-4B-Instruct`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-VL-4B-Instruct` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 1.87 tokens (with 4 draft tokens), meaning on average, it helps the base model advance nearly 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :--------- | :-------------------------------: | :-------------------------------: | | humaneval | 2.05 | 2.18 | | math500 | 2.01 | 2.15 | | ceval | 1.74 | 1.80 | | gsm8k | 1.74 | 1.78 | | cmmlu | 1.72 | 1.77 | | mtbench | 1.61 | 1.66 | | Average| ~1.81 | ~1.89 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 3 epochs on 8x MI308X GPUs, which took 56 hours and totaled 448 `MI308X GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型,专为加速 `Qwen3-VL-4B-Instruct` 大语言模型的推理而训练。 请注意:这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-VL-4B-Instruct`) 在推测解码 (speculative decoding) 框架下配合使用,才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-VL-4B-Instruct` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下,将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens),然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受,生成过程就能一次性前进多个步骤,从而实现显著的速度提升。 本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 1.81 个词元 (在草稿长度为4时)。 本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率,数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :------------------ | :-------------------------------: | :-------------------------------: | | humaneval | 2.05 | 2.18 | | math500 | 2.01 | 2.15 | | ceval | 1.74 | 1.80 | | gsm8k | 1.74 | 1.78 | | cmmlu | 1.72 | 1.77 | | mtbench | 1.61 | 1.66 | | 平均值 | ~1.81 | ~1.89 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 8x MI308X 训练 3 轮,耗时 56 小时,共 448 `MI308X 卡时`。