PowerInfer
SmallThinker-21BA3B-Instruct-GGUF
- GGUF models with `.gguf` suffix can used with llama.cpp framework. - GGUF models with `.powerinfer.gguf` suffix are integrated with fused sparse FFN operators and sparse LM head operators. These models are only compatible to powerinfer framework.   🤗 Hugging Face    |   🤖 ModelScope    |    📑 Technical Report    SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud. | Model | MMLU | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average | |------------------------------|-------|--------------|----------|--------|-----------|-----------|---------| | SmallThinker-21BA3B-Instruct | 84.43 | 55.05 | 82.4 | 85.77 | 60.3 | 89.63 | 76.26 | | Gemma3-12b-it | 78.52 | 34.85 | 82.4 | 74.68 | 44.5 | 82.93 | 66.31 | | Qwen3-14B | 84.82 | 50 | 84.6 | 85.21 | 59.5 | 88.41 | 75.42 | | Qwen3-30BA3B | 85.1 | 44.4 | 84.4 | 84.29 | 58.8 | 90.24 | 74.54 | | Qwen3-8B | 81.79 | 38.89 | 81.6 | 83.92 | 49.5 | 85.9 | 70.26 | | Phi-4-14B | 84.58 | 55.45 | 80.2 | 63.22 | 42.4 | 87.2 | 68.84 | For the MMLU evaluation, we use a 0-shot CoT setting. Speed | Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 | |--------------------------------------|---------------------|----------|-----------|--------------|----------------| | SmallThinker 21B+sparse | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 | | SmallThinker 21B+sparse+limited memory | limit 8G | 20.30 | 15.50 | 8.56 | - | | Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - | | Qwen3 30B A3B+limited memory | limit 8G | 10.11 | 0.18 | 6.32 | - | | Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 | | Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 | Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q40. You can deploy SmallThinker with offloading support using PowerInfer | Architecture | Mixture-of-Experts (MoE) | |:---:|:---:| | Total Parameters | 21B | | Activated Parameters | 3B | | Number of Layers | 52 | | Attention Hidden Dimension | 2560 | | MoE Hidden Dimension (per Expert) | 768 | | Number of Attention Heads | 28 | | Number of KV Heads | 4 | | Number of Experts | 64 | | Selected Experts per Token | 6 | | Vocabulary Size | 151,936 | | Context Length | 16K | | Attention Mechanism | GQA | | Activation Function | ReGLU | `transformers==4.53.3` is required, we are actively working to support the latest version. The following contains a code snippet illustrating how to use the model generate content based on given inputs. `ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows: Statement - Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information. - Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content. - SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.
SmallThinker-4BA0.6B-Instruct-GGUF
- GGUF models with `.gguf` suffix can used with llama.cpp framework. - GGUF models with `.powerinfer.gguf` suffix are integrated with fused sparse FFN operators and sparse LM head operators. These models are only compatible to powerinfer framework.   🤗 Hugging Face    |   🤖 ModelScope    |    📑 Technical Report    SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud. | Model | MMLU | GPQA-diamond | GSM8K | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | SmallThinker-4BA0.6B-Instruct | 66.11 | 31.31 | 80.02 | 60.60 | 69.69 | 42.20 | 82.32 | 61.75 | | Qwen3-0.6B | 43.31 | 26.77 | 62.85 | 45.6 | 58.41 | 23.1 | 31.71 | 41.67 | | Qwen3-1.7B | 64.19 | 27.78 | 81.88 | 63.6 | 69.50 | 35.60 | 61.59 | 57.73 | | Gemma3nE2b-it | 63.04 | 20.2 | 82.34 | 58.6 | 73.2 | 27.90 | 64.63 | 55.70 | | Llama-3.2-3B-Instruct | 64.15 | 24.24 | 75.51 | 40 | 71.16 | 15.30 | 55.49 | 49.41 | | Llama-3.2-1B-Instruct | 45.66 | 22.73 | 1.67 | 14.4 | 48.06 | 13.50 | 37.20 | 26.17 | For the MMLU evaluation, we use a 0-shot CoT setting. Speed | Model | Memory(GiB) | i9 14900 | 1+13 8gen4 | rk3588 (16G) | rk3576 | Raspberry PI 5 | RDK X5 | rk3566 | |-----------------------------------------------|---------------------|----------|------------|--------------|--------|----------------|--------|--------| | SmallThinker 4B+sparse ffn +sparse lmhead | 2.24 | 108.17 | 78.99 | 39.76 | 15.10 | 28.77 | 7.23 | 6.33 | | SmallThinker 4B+sparse ffn +sparse lmhead+limited memory | limit 1G| 29.99 | 20.91 | 15.04 | 2.60 | 0.75 | 0.67 | 0.74 | | Qwen3 0.6B | 0.6 | 148.56 | 94.91 | 45.93 | 15.29 | 27.44 | 13.32 | 9.76 | | Qwen3 1.7B | 1.3 | 62.24 | 41.00 | 20.29 | 6.09 | 11.08 | 6.35 | 4.15 | | Qwen3 1.7B+limited memory | limit 1G | 2.66 | 1.09 | 1.00 | 0.47 | - | - | 0.11 | | Gemma3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 3.80 | 6.66 | 3.46 | 2.45 | Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q40. You can deploy SmallThinker with offloading support using PowerInfer | Architecture | Mixture-of-Experts (MoE) | |:---:|:---:| | Total Parameters | 4B | | Activated Parameters | 0.6B | | Number of Layers | 32 | | Attention Hidden Dimension | 1536 | | MoE Hidden Dimension (per Expert) | 768 | | Number of Attention Heads | 12 | | Number of Experts | 32 | | Selected Experts per Token | 4 | | Vocabulary Size | 151,936 | | Context Length | 32K | | Attention Mechanism | GQA | | Activation Function | ReGLU | `transformers==4.53.3` is required, we are actively working to support the latest version. The following contains a code snippet illustrating how to use the model generate content based on given inputs. `ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows: Statement - Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information. - Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content. - SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.