taobao-mnn

This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-4B-Instruct-2507` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-4B-Instruct-2507`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-4B-Instruct-2507` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 2.08 tokens (with 4 draft tokens), meaning on average, it helps the base model advance over 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | | :--------- | :-------------------------------: | | gsm8k | 2.22 | | humaneval | 2.29 | | math500 | 2.27 | | cmmlu | 1.94 | | ceval | 1.93 | | mtbench | 1.85 | | Average| ~2.08 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 3 epochs on 8x MI308X GPUs, which took 56 hours and totaled 448 `MI308X GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型，专为加速 `Qwen3-4B-Instruct-2507` 大语言模型的推理而训练。请注意：这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-4B-Instruct-2507`) 在推测解码 (speculative decoding) 框架下配合使用，才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-4B-Instruct-2507` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下，将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens)，然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受，生成过程就能一次性前进多个步骤，从而实现显著的速度提升。本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 2.08 个词元 (在草稿长度为4时)，这意味着在每次验证中，它平均能帮助基座模型推进超过 2 个词元。本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率，数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | | :------------------ | :-------------------------------: | | gsm8k | 2.22 | | humaneval | 2.29 | | math500 | 2.27 | | cmmlu | 1.94 | | ceval | 1.93 | | mtbench | 1.85 | | 平均值 | ~2.08 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 8x MI308X 训练 3 轮，耗时 56 小时，共 448 `MI308X 卡时`。

NaNK

llama

MobileLLM-600M-MNN

license:apache-2.0

Qwen3Guard-Gen-4B-MNN

NaNK

license:apache-2.0

Qwen3Guard-Stream-4B-MNN

NaNK

license:apache-2.0

Qwen3Guard-Stream-0.6B-MNN

NaNK

license:apache-2.0

Qwen3-30B-A3B-Instruct-2507-MNN

NaNK

license:apache-2.0

Hunyuan-MT-7B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Hunyuan-MT-7B using llmexport.

NaNK

license:apache-2.0

glm-4-9b-chat-MNN

NaNK

license:apache-2.0

SmolLM2-135M-Instruct-MNN

license:apache-2.0

DeepSeek-Prover-V2-7B-MNN

NaNK

license:apache-2.0

Qwen3Guard-Gen-8B-MNN

NaNK

license:apache-2.0

FastVLM-1.5B-Stage2-MNN

Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-1.5B-Stage2 using llmexport.

NaNK

license:apache-2.0

Qwen3 VL 2B Instruct MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Instruct using llmexport.

NaNK

license:apache-2.0

Qwen3.5-27B-MNN

NaNK

license:apache-2.0

Qwen3-VL-2B-Thinking-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Thinking using llmexport.

NaNK

license:apache-2.0

gemma-3-270m-it-MNN

license:apache-2.0

Meta-Llama-3-8B-Instruct-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Meta-Llama-3-8B-Instruct using llmexport.

NaNK

license:apache-2.0

deepseek-llm-7b-chat-MNN

NaNK

license:apache-2.0

Llama-2-7b-chat-ms-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Llama-2-7b-chat using llmexport.

NaNK

license:apache-2.0

Qwen3-Embedding-4B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Embedding-4B using llmexport.

NaNK

license:apache-2.0

SmolLM2-360M-Instruct-MNN

license:apache-2.0

gemma-2-9b-it-MNN

NaNK

license:apache-2.0

gemma-7b-it-MNN

NaNK

license:apache-2.0

chatglm3-6b-MNN

NaNK

license:apache-2.0

MobileLLM-350M-MNN

license:apache-2.0

TinyLlama-1.1B-Chat-v1.0-MNN

NaNK

license:apache-2.0

Qwen2-VL-7B-Instruct-MNN

NaNK

license:apache-2.0

Qwen2.5-Math-1.5B-Instruct-MNN

NaNK

license:apache-2.0

FastVLM-0.5B-Stage2-MNN

Introduction This model is a 8-bit quantized version of the MNN model exported from FastVLM-0.5B-Stage2 using llmexport.

NaNK

license:apache-2.0

Qwen3-Embedding-8B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Embedding-8B using llmexport.

NaNK

license:apache-2.0

Qwen-VL-Chat-MNN

license:apache-2.0

Qwen2.5-VL-7B-Instruct-MNN

NaNK

license:apache-2.0

Qwen3-32B-MNN

NaNK

license:apache-2.0

Qwen3-30B-A3B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-30B-A3B using llmexport.

NaNK

license:apache-2.0

Qwen2-0.5B-Instruct-MNN

NaNK

license:apache-2.0

Qwen3-4B-Instruct-2507-Eagle3-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-4B-Instruct-2507-Eagle3 using llmexport.

NaNK

license:apache-2.0

Qwen3-VL-2B-Thinking-Eagle3

This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-VL-2B-Thinking` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-VL-2B-Thinking`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-VL-2B-Thinking` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 1.87 tokens (with 4 draft tokens), meaning on average, it helps the base model advance nearly 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :--------- | :-------------------------------: | :-------------------------------: | | humaneval | 1.80 | 1.85 | | gsm8k | 1.77 | 1.80 | | math500 | 1.75 | 1.81 | | ceval | 1.70 | 1.74 | | cmmlu | 1.65 | 1.70 | | mtbench | 1.61 | 1.65 | | Average| ~1.71 | ~1.76 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 2 epochs on 4x H20 GPUs, which took 27 hours and totaled 108 `H20 GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型，专为加速 `Qwen3-VL-2B-Thinking` 大语言模型的推理而训练。请注意：这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-VL-2B-Thinking`) 在推测解码 (speculative decoding) 框架下配合使用，才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-VL-2B-Thinking` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下，将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens)，然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受，生成过程就能一次性前进多个步骤，从而实现显著的速度提升。本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 1.71 个词元 (在草稿长度为4时)。本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率，数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :------------------ | :-------------------------------: | :-------------------------------: | | humaneval | 1.80 | 1.85 | | gsm8k | 1.77 | 1.80 | | math500 | 1.75 | 1.81 | | ceval | 1.70 | 1.74 | | cmmlu | 1.65 | 1.70 | | mtbench | 1.61 | 1.65 | | 平均值 | ~1.71 | ~1.76 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 4x H20 训练 2 轮，耗时 27 小时，共 108 `H20 卡时`。

NaNK

llama

Qwen-7B-Chat-MNN

NaNK

license:apache-2.0

internlm-chat-7b-MNN

NaNK

license:apache-2.0

Baichuan2-7B-Chat-MNN

NaNK

license:apache-2.0

Qwen2-7B-Instruct-MNN

NaNK

license:apache-2.0

MiMo-7B-RL-MNN

NaNK

license:apache-2.0

Qwen3-Reranker-0.6B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Reranker-0.6B using llmexport.

NaNK

license:apache-2.0

SmolLM2-1.7B-Instruct-MNN

NaNK

license:apache-2.0

Qwen2-1.5B-Instruct-MNN

NaNK

license:apache-2.0

Qwen2.5-Math-7B-Instruct-MNN

NaNK

license:apache-2.0

Yi-6B-Chat-MNN

NaNK

license:apache-2.0

Qwen2-Audio-7B-Instruct-MNN

NaNK

license:apache-2.0

MiMo-7B-SFT-MNN

NaNK

license:apache-2.0

SmolVLM-Instruct-MNN

license:apache-2.0

Qwen3-Reranker-4B-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-Reranker-4B using llmexport.

NaNK

license:apache-2.0

Qwen3-Reranker-8B-MNN

NaNK

license:apache-2.0

Qwen3-VL-2B-Instruct-Eagle3-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from Qwen3-VL-2B-Instruct-Eagle3 using llmexport.

NaNK

license:apache-2.0

MiMo-7B-Base-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from MiMo-7B-Base using llmexport.

NaNK

license:apache-2.0

Qwen3.5-35B-A3B-MNN

NaNK

license:apache-2.0

MiMo-7B-RL-Zero-MNN

Introduction This model is a 4-bit quantized version of the MNN model exported from MiMo-7B-RL-Zero using llmexport.

NaNK

license:apache-2.0

SmolDocling-256M-preview-MNN

license:apache-2.0

reader-lm-0.5b-MNN

NaNK

license:apache-2.0

QwQ-32B-MNN

NaNK

license:apache-2.0

Qwen3.5-9B-MNN

NaNK

license:apache-2.0

OpenELM-3B-Instruct-MNN

NaNK

license:apache-2.0

Qwen1.5-4B-Chat-MNN

NaNK

license:apache-2.0

Qwen1.5-7B-Chat-MNN

NaNK

license:apache-2.0

bge-large-zh-MNN

license:apache-2.0

Qwen3-VL-4B-Instruct-Eagle3

This repository contains an EAGLE-3 style draft model specifically trained to accelerate the inference of the `Qwen3-VL-4B-Instruct` large language model. This is not a standalone model. It must be used in conjunction with its corresponding base model (`Qwen3-VL-4B-Instruct`) within a speculative decoding framework to achieve significant speedups in text generation. - Base Model: `Qwen3-VL-4B-Instruct` - Model Architecture: EAGLE-3 (Speculative Decoding Draft Model) - Primary Benefit: Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model. EAGLE (Extrapolative A Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed. This model serves as the "draft model" in this process. Its average acceptance length (`acclength`) on standard benchmarks is approximately 1.87 tokens (with 4 draft tokens), meaning on average, it helps the base model advance nearly 2 tokens per verification step. This model was evaluated on a diverse set of benchmarks. The `acclength` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better. | Benchmark | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :--------- | :-------------------------------: | :-------------------------------: | | humaneval | 2.05 | 2.18 | | math500 | 2.01 | 2.15 | | ceval | 1.74 | 1.80 | | gsm8k | 1.74 | 1.78 | | cmmlu | 1.72 | 1.77 | | mtbench | 1.61 | 1.66 | | Average| ~1.81 | ~1.89 | These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation. - Training Framework: This model was trained using SpecForge, an open-source framework for speculative decoding research. - Training Data: The model was trained on the EagleChat dataset. Available on Hugging Face and ModelScope. - Training Duration: The model was trained for 3 epochs on 8x MI308X GPUs, which took 56 hours and totaled 448 `MI308X GPU-hours`. 本仓库包含一个 EAGLE-3 风格的草稿模型，专为加速 `Qwen3-VL-4B-Instruct` 大语言模型的推理而训练。请注意：这是一个非独立模型。它必须与对应的基座模型 (`Qwen3-VL-4B-Instruct`) 在推测解码 (speculative decoding) 框架下配合使用，才能实现显著的文本生成加速效果。 - 基座模型: `Qwen3-VL-4B-Instruct` - 模型架构: EAGLE-3 (推测解码草稿模型) - 核心优势: 在不牺牲基座模型生成质量的前提下，将文本生成吞吐量提升 1.5 到 2.5 倍。 EAGLE (Extrapolative A Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens)，然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受，生成过程就能一次性前进多个步骤，从而实现显著的速度提升。本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acclength`) 约为 1.81 个词元 (在草稿长度为4时)。本模型在一系列多样化的评测基准上进行了评估。`acclength` (平均接受的草稿词元数) 反映了加速的效率，数值越高越好。 | 评测基准 (Benchmark) | `acclength` (numdrafttokens=4) | `acclength` (numdrafttokens=8) | | :------------------ | :-------------------------------: | :-------------------------------: | | humaneval | 2.05 | 2.18 | | math500 | 2.01 | 2.15 | | ceval | 1.74 | 1.80 | | gsm8k | 1.74 | 1.78 | | cmmlu | 1.72 | 1.77 | | mtbench | 1.61 | 1.66 | | 平均值 | ~1.81 | ~1.89 | - 训练框架: 本模型使用开源推测解码研究框架 SpecForge 进行训练。 - 训练数据: 训练数据使用了 EagleChat 数据集。您可以在 Hugging Face 或 ModelScope 上获取该数据集。 - 训练耗时: 训练使用 8x MI308X 训练 3 轮，耗时 56 小时，共 448 `MI308X 卡时`。

NaNK

llama