PowerInfer

24 models • 1 total models in database

Sort by:

SmallThinker-21BA3B-Instruct-GGUF

- GGUF models with `.gguf` suffix can used with llama.cpp framework. - GGUF models with `.powerinfer.gguf` suffix are integrated with fused sparse FFN operators and sparse LM head operators. These models are only compatible to powerinfer framework. &nbsp&nbsp🤗 Hugging Face &nbsp&nbsp | &nbsp&nbsp🤖 ModelScope &nbsp&nbsp | &nbsp&nbsp 📑 Technical Report &nbsp&nbsp SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud. | Model | MMLU | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average | |------------------------------|-------|--------------|----------|--------|-----------|-----------|---------| | SmallThinker-21BA3B-Instruct | 84.43 | 55.05 | 82.4 | 85.77 | 60.3 | 89.63 | 76.26 | | Gemma3-12b-it | 78.52 | 34.85 | 82.4 | 74.68 | 44.5 | 82.93 | 66.31 | | Qwen3-14B | 84.82 | 50 | 84.6 | 85.21 | 59.5 | 88.41 | 75.42 | | Qwen3-30BA3B | 85.1 | 44.4 | 84.4 | 84.29 | 58.8 | 90.24 | 74.54 | | Qwen3-8B | 81.79 | 38.89 | 81.6 | 83.92 | 49.5 | 85.9 | 70.26 | | Phi-4-14B | 84.58 | 55.45 | 80.2 | 63.22 | 42.4 | 87.2 | 68.84 | For the MMLU evaluation, we use a 0-shot CoT setting. Speed | Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 | |--------------------------------------|---------------------|----------|-----------|--------------|----------------| | SmallThinker 21B+sparse | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 | | SmallThinker 21B+sparse+limited memory | limit 8G | 20.30 | 15.50 | 8.56 | - | | Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - | | Qwen3 30B A3B+limited memory | limit 8G | 10.11 | 0.18 | 6.32 | - | | Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 | | Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 | Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q40. You can deploy SmallThinker with offloading support using PowerInfer | Architecture | Mixture-of-Experts (MoE) | |:---:|:---:| | Total Parameters | 21B | | Activated Parameters | 3B | | Number of Layers | 52 | | Attention Hidden Dimension | 2560 | | MoE Hidden Dimension (per Expert) | 768 | | Number of Attention Heads | 28 | | Number of KV Heads | 4 | | Number of Experts | 64 | | Selected Experts per Token | 6 | | Vocabulary Size | 151,936 | | Context Length | 16K | | Attention Mechanism | GQA | | Activation Function | ReGLU | `transformers==4.53.3` is required, we are actively working to support the latest version. The following contains a code snippet illustrating how to use the model generate content based on given inputs. `ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows: Statement - Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information. - Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content. - SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.

PowerInfer

SmallThinker-21BA3B-Instruct-GGUF

SmallThinker-4BA0.6B-Instruct-GGUF

SmallThinker 4BA0.6B Instruct

SmallThinker-4BA0.6B-Instruct-INT4-QAT

SmallThinker-21BA3B-Instruct

ReluLLaMA-7B-PowerInfer-GGUF

Bamboo-base-v0.1-gguf

ReluFalcon-40B-PowerInfer-GGUF

prosparse-llama-2-7b-gguf

Bamboo-DPO-v0.1-gguf

ReluLLaMA-13B-PowerInfer-GGUF

prosparse-llama-2-13b-gguf

ReluLLaMA-70B-PowerInfer-GGUF

prosparse-llama-2-7b-predictor

TurboSparse-Mixtral

Bamboo-base-v0_1

Bamboo-DPO-v0_1

TurboSparse-Mixtral-Instruct

TurboSparse-Mistral-Instruct

SparseQwen2-7B

Bamboo-base-v0.1-predictor

ReluLLaMA-7B-Predictor

OPT-7B-predictor

ReluLLaMA-70B-Predictor