llama-moe

8 models • 1 total models in database
Sort by:

LLaMA-MoE-v1-3_5B-4_16

NaNK
llama_moe
183
16

LLaMA-MoE-v1-3_5B-2_8

NaNK
llama_moe
48
15

LLaMA MoE V1 3 0B 2 16

[[šŸ’» Code]](https://github.com/pjlab-sys4nlp/llama-moe) | [[šŸ“œ Technical Report]](https://github.com/pjlab-sys4nlp/llama-moe/blob/main/docs/LLaMAMoE.pdf) ā¤ļø This repo contains the model `LLaMA-MoE-v1-3.0B (2/16)`, which activates 2 out of 16 experts (3.0B parameters). This model is NOT fine-tuned by instruction pairs, so it may not be good enough to act like a chatbot. šŸ“¢ LLaMA-MoE is a series of Mixture-of-Expert (MoE) models based on LLaMA-2. You can find the code for training this model at this repo. šŸ’Ž This series of models are obtained by partitioning original LLaMA FFNs into experts and further continual pre-training. The total model size is only 6.7B parameters, which is very convenient for deployment and research usage. More details could be found at our technical report. | Model | \#Activated Experts | \#Experts | \#Activated Params | Links | | :------------------------ | :-----------------: | :-------: | :----------------: | :-----------------------------------------------------------------------: | | LLaMA-MoE-3.0B | 2 | 16 | 3.0B | [[šŸ¤— HF Weights]](https://huggingface.co/llama-moe/LLaMA-MoE-v1-30B-216) | | LLaMA-MoE-3.5B (4/16) | 4 | 16 | 3.5B | [[šŸ¤— HF Weights]](https://huggingface.co/llama-moe/LLaMA-MoE-v1-35B-416) | | LLaMA-MoE-3.5B (2/8) | 2 | 8 | 3.5B | [[šŸ¤— HF Weights]](https://huggingface.co/llama-moe/LLaMA-MoE-v1-35B-28) | | Model | SciQ | PIQA | WinoGrande | ARC-e | ARC-c (25) | HellaSwag (10) | LogiQA | BoolQ (32) | LAMBADA | NQ (32) | MMLU (5) | Average | | :------------------------------------------------------------------------------------ | :------: | :------: | :--------: | :------: | :--------: | :------------: | :------: | :--------: | :------: | :------: | :-------: | :-----: | | OPT-2.7B | 78.9 | 74.8 | 60.8 | 54.4 | 34.0 | 61.4 | 25.8 | 63.3 | 63.6 | 10.7 | 25.8 | 50.3 | | Pythia-2.8B | 83.2 | 73.6 | 59.6 | 58.8 | 36.7 | 60.7 | 28.1 | 65.9 | 64.6 | 8.7 | 26.8 | 51.5 | | INCITE-BASE-3B | 85.6 | 73.9 | 63.5 | 61.7 | 40.3 | 64.7 | 27.5 | 65.8 | 65.4 | 15.2 | 27.2 | 53.7 | | Open-LLaMA-3B-v2 | 88.0 | 77.9 | 63.1 | 63.3 | 40.1 | 71.4 | 28.1 | 69.2 | 67.4 | 16.0 | 26.8 | 55.6 | | Sheared-LLaMA-2.7B | 87.5 | 76.9 | 65.0 | 63.3 | 41.6 | 71.0 | 28.3 | 73.6 | 68.3 | 17.6 | 27.3 | 56.4 | | LLaMA-MoE-3.0B | 84.2 | 77.5 | 63.6 | 60.2 | 40.9 | 70.8 | 30.6 | 71.9 | 66.6 | 17.0 | 26.8 | 55.5 | | LLaMA-MoE-3.5B (4/16) | 87.6 | 77.9 | 65.5 | 65.6 | 44.2 | 73.3 | 29.7 | 75.0 | 69.5 | 20.3 | 26.8 | 57.7 | | LLaMA-MoE-3.5B (2/8) | 88.4 | 77.6 | 66.7 | 65.3 | 43.1 | 73.3 | 29.6 | 73.9 | 69.4 | 19.8 | 27.0 | 57.6 | Training Data: 200B tokens from SlimPajama with the same data sampling weights as Sheared LLaMA.

NaNK
llama_moe
29
11

LLaMA-MoE-v2-3_8B-2_8-sft

NaNK
license:apache-2.0
6
3

LLaMA-MoE-v1-3_0B-2_16-sft

NaNK
llama_moe
6
2

LLaMA-MoE-v2-3_8B-residual-sft

NaNK
license:apache-2.0
2
2

LLaMA-MoE-v1-3_5B-2_8-sft

NaNK
llama_moe
1
3

LLaMA-MoE-v1-3_5B-4_16-sft

NaNK
llama_moe
1
1