llm-jp

215 models • 5 total models in database
Sort by:

llm-jp-3-1.8b

NaNK
llama
18,543
15

llm-jp-3.1-1.8b-instruct4

LLM-jp-3.1 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. Building upon the LLM-jp-3 series, the LLM-jp-3.1 models incorporate mid-training (instruction pre-training), which significantly enhances their instruction-following capabilities compared to the original LLM-jp-3 models. This repository provides the llm-jp-3.1-1.8b-instruct4 model. For an overview of the LLM-jp-3.1 models across different parameter sizes, please refer to: - LLM-jp-3.1 Pre-trained Models - LLM-jp-3.1 Fine-tuned Models. For more details on the training procedures and evaluation results, please refer to this blog post (in Japanese). - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 - Model type: Transformer-based Language Model - Architectures: Dense model: |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |1.8b|24|2048|16|4096|407,498,752|1,459,718,144| |13b|40|5120|40|4096|1,018,746,880|12,688,184,320| MoE model: |Params|Layers|Hidden size|Heads|Routed Experts|Activated Experts|Context length|Embedding parameters|Non-embedding parameters|Activated parameters|Total parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |8x13b|40|5120|40|8|2|4096|1,018,746,880|72,144,081,920|22,200,806,400|73,162,828,800| The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v3.0`. Please refer to README.md of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary). The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---|:---|---:| |Japanese|Wikipedia|2.6B ||Common Crawl|762.8B ||WARP/PDF|237.3B ||WARP/HTML|2.7B ||Kaken|1.8B |English|Wikipedia|4.7B ||Dolma/CC-head|608.5B ||Dolma/C4|181.6B ||Dolma/Reddit|83.1B ||Dolma/PeS2o|62.9B ||Dolma/Gutenberg|5.5B ||Dolma/Wiki|3.9B |Code|The Stack|114.1B |Chinese|Wikipedia|0.8B |Korean|Wikipedia|0.3B In the LLM-jp-3.1 series, we performed continuous pre-training based on Instruction Pre-Training. Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs. We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens. Each model was initialized from existing checkpoints (llm-jp/llm-jp-3-1.8b, llm-jp/llm-jp-3-13b, and llm-jp/llm-jp-3-8x13b) and underwent continuous instruction pre-training. Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens. Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available. We have fine-tuned the pre-trained checkpoint with supervised fine-tuning and further aligned it with Direct Preference Optimization. Supervised Fine-tuning The datasets used for supervised fine-tuning are as follows: | Language | Dataset | Description | |:---|:---|:---| |Japanese|ichikara-instruction-004-002| A manually constructed instruction dataset. | | |AnswerCarefully (ver2.0)| A manually constructed instruction dataset focusing on LLMs' safety. | | |ichikara-instruction-format| A small subset of the ichikara-instruction dataset, edited with some constraints on the output format. | | |AutoMultiTurnByCalm3-22B| A synthetic instruction dataset. | | |ramdom-to-fixed-multiturn-Calm3| A synthetic instruction dataset. | | |wizardlm8x22b-logical-math-coding-sft-ja| A synthetic instruction dataset. | | |magpie-sft-v1.0| A synthetic instruction dataset we created. | | |jaster v1.4.1| - | | |extraction-wiki-ja| A synthetic instruction dataset we created. | |English|Daring-Anteater| - | |Japanese & English|Synthetic-JP-EN-Coding-Dataset| A synthetic instruction dataset. | For Direct Preference Optimization (DPO), we adopted rejection sampling. Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt. These responses were then scored (by Qwen/Qwen2.5-32B-Instruct), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples. We conducted DPO in two stages. In the second stage, we additionally used ac-self-inst, a Japanese preference dataset focused on safety. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model Name | JA | EN | |:------------------------------------------------------------------------------------------------------------------------------|----------:|-------:| | gpt-35-turbo-1106 | 6.48 | 7.56 | | gpt-4-0613 | 7.29 | 7.72 | | gpt-4o-2024-08-06 | 8.10 | 8.38 | | sbintuitions/sarashina2.2-1b-instruct-v0.1 | 5.30 | 5.66 | | sbintuitions/sarashina2.2-3b-instruct-v0.1 | 7.07 | 6.96 | | Rakuten/RakutenAI-2.0-8x7B-instruct | 6.68 | 6.33 | | cyberagent/calm3-22b-chat | 6.86 | 6.77 | | Qwen/Qwen2.5-14B-Instruct | 7.07 | 7.99 | | Qwen/Qwen2.5-32B-Instruct | 7.64 | 8.27 | | Qwen/Qwen3-1.7B | 5.46 | 6.95 | | Qwen/Qwen3-14B | 8.00 | 8.30 | | Qwen/Qwen3-32B | 8.36 | 8.33 | | tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4 | 7.64 | 8.02 | | stockmark/Stockmark-2-100B-Instruct-beta | 7.42 | 7.17 | | llm-jp-3-1.8b-instruct3 | 4.64 | 4.09 | | llm-jp-3-13b-instruct3 | 6.21 | 6.13 | | llm-jp-3-8x13b-instruct3 | 6.60 | 6.49 | | llm-jp-3.1-1.8b-instruct4 | 6.30 | 5.70 | | llm-jp-3.1-13b-instruct4 | 7.37 | 7.01 | | llm-jp-3.1-8x13b-instruct4 | 7.50 | 7.05 | AnswerCarefully-Eval assesses the safety of Japanese language model outputs using the LLM-as-a-Judge approach, based on the test set from llm-jp/AnswerCarefully. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model name | Score | Acceptance rate (%, ↑) | Violation rate (%, ↓) | | :--- | ---: | ---: | ---: | | gpt-35-turbo-1106 | 3.98 | 71.7 | 12.6 | | gpt-4-0613 | 4.06 | 72.3 | 13.2 | | gpt-4o-2024-08-06 | 4.09 | 72.7 | 12.5 | | llm-jp-3-1.8b-instruct3 | 4.03 | 75.9 | 12.2 | | llm-jp-3-13b-instruct3 | 4.37 | 88.4 | 6.5 | | llm-jp-3-8x13b-instruct3 | 4.48 | 91.6 | 4.3 | | llm-jp-3.1-1.8b-instruct4 | 3.66 | 64.7 | 24.3 | | llm-jp-3.1-13b-instruct4 | 4.17 | 82.4 | 12.2 | | llm-jp-3.1-8x13b-instruct4 | 4.26 | 83.1 | 11.6 | The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

NaNK
llama
17,552
12

llm-jp-3-1.8b-instruct

NaNK
llama
7,359
25

llm-jp-3.1-13b-instruct4

LLM-jp-3.1 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. Building upon the LLM-jp-3 series, the LLM-jp-3.1 models incorporate mid-training (instruction pre-training), which significantly enhances their instruction-following capabilities compared to the original LLM-jp-3 models. This repository provides the llm-jp-3.1-13b-instruct4 model. For an overview of the LLM-jp-3.1 models across different parameter sizes, please refer to: - LLM-jp-3.1 Pre-trained Models - LLM-jp-3.1 Fine-tuned Models. For more details on the training procedures and evaluation results, please refer to this blog post (in Japanese). - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 - Model type: Transformer-based Language Model - Architectures: Dense model: |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |1.8b|24|2048|16|4096|407,498,752|1,459,718,144| |13b|40|5120|40|4096|1,018,746,880|12,688,184,320| MoE model: |Params|Layers|Hidden size|Heads|Routed Experts|Activated Experts|Context length|Embedding parameters|Non-embedding parameters|Activated parameters|Total parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |8x13b|40|5120|40|8|2|4096|1,018,746,880|72,144,081,920|22,200,806,400|73,162,828,800| The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v3.0`. Please refer to README.md of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary). The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---|:---|---:| |Japanese|Wikipedia|2.6B ||Common Crawl|762.8B ||WARP/PDF|237.3B ||WARP/HTML|2.7B ||Kaken|1.8B |English|Wikipedia|4.7B ||Dolma/CC-head|608.5B ||Dolma/C4|181.6B ||Dolma/Reddit|83.1B ||Dolma/PeS2o|62.9B ||Dolma/Gutenberg|5.5B ||Dolma/Wiki|3.9B |Code|The Stack|114.1B |Chinese|Wikipedia|0.8B |Korean|Wikipedia|0.3B In the LLM-jp-3.1 series, we performed continuous pre-training based on Instruction Pre-Training. Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs. We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens. Each model was initialized from existing checkpoints (llm-jp/llm-jp-3-1.8b, llm-jp/llm-jp-3-13b, and llm-jp/llm-jp-3-8x13b) and underwent continuous instruction pre-training. Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens. Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available. We have fine-tuned the pre-trained checkpoint with supervised fine-tuning and further aligned it with Direct Preference Optimization. Supervised Fine-tuning The datasets used for supervised fine-tuning are as follows: | Language | Dataset | Description | |:---|:---|:---| |Japanese|ichikara-instruction-004-002| A manually constructed instruction dataset. | | |AnswerCarefully (ver2.0)| A manually constructed instruction dataset focusing on LLMs' safety. | | |ichikara-instruction-format| A small subset of the ichikara-instruction dataset, edited with some constraints on the output format. | | |AutoMultiTurnByCalm3-22B| A synthetic instruction dataset. | | |ramdom-to-fixed-multiturn-Calm3| A synthetic instruction dataset. | | |wizardlm8x22b-logical-math-coding-sft-ja| A synthetic instruction dataset. | | |magpie-sft-v1.0| A synthetic instruction dataset we created. | | |jaster v1.4.1| - | | |extraction-wiki-ja| A synthetic instruction dataset we created. | |English|Daring-Anteater| - | |Japanese & English|Synthetic-JP-EN-Coding-Dataset| A synthetic instruction dataset. | For Direct Preference Optimization (DPO), we adopted rejection sampling. Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt. These responses were then scored (by Qwen/Qwen2.5-32B-Instruct), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples. We conducted DPO in two stages. In the second stage, we additionally used ac-self-inst, a Japanese preference dataset focused on safety. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model Name | JA | EN | |:------------------------------------------------------------------------------------------------------------------------------|----------:|-------:| | gpt-35-turbo-1106 | 6.48 | 7.56 | | gpt-4-0613 | 7.29 | 7.72 | | gpt-4o-2024-08-06 | 8.10 | 8.38 | | sbintuitions/sarashina2.2-1b-instruct-v0.1 | 5.30 | 5.66 | | sbintuitions/sarashina2.2-3b-instruct-v0.1 | 7.07 | 6.96 | | Rakuten/RakutenAI-2.0-8x7B-instruct | 6.68 | 6.33 | | cyberagent/calm3-22b-chat | 6.86 | 6.77 | | Qwen/Qwen2.5-14B-Instruct | 7.07 | 7.99 | | Qwen/Qwen2.5-32B-Instruct | 7.64 | 8.27 | | Qwen/Qwen3-1.7B | 5.46 | 6.95 | | Qwen/Qwen3-14B | 8.00 | 8.30 | | Qwen/Qwen3-32B | 8.36 | 8.33 | | tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4 | 7.64 | 8.02 | | stockmark/Stockmark-2-100B-Instruct-beta | 7.42 | 7.17 | | llm-jp-3-1.8b-instruct3 | 4.64 | 4.09 | | llm-jp-3-13b-instruct3 | 6.21 | 6.13 | | llm-jp-3-8x13b-instruct3 | 6.60 | 6.49 | | llm-jp-3.1-1.8b-instruct4 | 6.30 | 5.70 | | llm-jp-3.1-13b-instruct4 | 7.37 | 7.01 | | llm-jp-3.1-8x13b-instruct4 | 7.50 | 7.05 | AnswerCarefully-Eval assesses the safety of Japanese language model outputs using the LLM-as-a-Judge approach, based on the test set from llm-jp/AnswerCarefully. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model name | Score | Acceptance rate (%, ↑) | Violation rate (%, ↓) | | :--- | ---: | ---: | ---: | | gpt-35-turbo-1106 | 3.98 | 71.7 | 12.6 | | gpt-4-0613 | 4.06 | 72.3 | 13.2 | | gpt-4o-2024-08-06 | 4.09 | 72.7 | 12.5 | | llm-jp-3-1.8b-instruct3 | 4.03 | 75.9 | 12.2 | | llm-jp-3-13b-instruct3 | 4.37 | 88.4 | 6.5 | | llm-jp-3-8x13b-instruct3 | 4.48 | 91.6 | 4.3 | | llm-jp-3.1-1.8b-instruct4 | 3.66 | 64.7 | 24.3 | | llm-jp-3.1-13b-instruct4 | 4.17 | 82.4 | 12.2 | | llm-jp-3.1-8x13b-instruct4 | 4.26 | 83.1 | 11.6 | The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

NaNK
llama
6,993
14

llm-jp-4-8b-thinking

NaNK
llama
3,512
19

llm-jp-3-13b

This repository provides large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. | Model Variants | | :--- | | llm-jp-3-1.8b | | llm-jp-3-1.8b-instruct | | llm-jp-3-3.7b | | llm-jp-3-3.7b-instruct | | llm-jp-3-13b | | llm-jp-3-13b-instruct | | llm-jp-3-172b-beta1 | | llm-jp-3-172b-beta1-instruct | - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 - Model type: Transformer-based Language Model - Total seen tokens: 2.1T |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |1.8b|24|2048|16|4096|407,896,064|1,459,718,144| |3.7b|28|3072|24|4096|611,844,096|3,171,068,928| |13b|40|5120|40|4096|1,019,740,160|12,688,184,320| The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v3.0`. Please refer to README.md of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary). The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---|:---|---:| |Japanese|Wikipedia|2.6B ||Common Crawl|762.8B ||WARP/PDF|237.3B ||WARP/HTML|2.7B ||Kaken|1.8B |English|Wikipedia|4.7B ||Dolma/CC-head|608.5B ||Dolma/C4|181.6B ||Dolma/Reddit|83.1B ||Dolma/PeS2o|62.9B ||Dolma/Gutenberg|5.5B ||Dolma/Wiki|3.9B |Code|The Stack|114.1B |Chinese|Wikipedia|0.8B |Korean|Wikipedia|0.3B The models have been fine-tuned on the following datasets. | Language | Dataset | description | |:---|:---|:---| |Japanese|ichikara-instruction-004-002| A manually constructed instruction dataset | | |answer-carefully-002| A manually constructed instruction dataset focusing on LLMs' safety | | |ichikara-instruction-format| A small amount of instruction dataset edited from ichikara-instruction, with some constraints on the output format. | | |AutoMultiTurnByCalm3-22B| A synthetic instruction dataset. | | |ramdom-to-fixed-multiturn-Calm3| A synthetic instruction dataset. | | |wizardlm8x22b-logical-math-coding-sftadditional-ja| A synthetic instruction dataset. | | |Synthetic-JP-EN-Coding-Dataset-567k| A synthetic instruction dataset. We used sampled one.| |English |FLAN | We used sampled one. | We evaluated the models using 100 examples from the dev split. | Model name | average | EL | FA | HE | MC | MR | MT | NLI | QA | RC | | :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | llm-jp-3-1.8b | 0.3767 | 0.3725 | 0.1948 | 0.2350 | 0.2500 | 0.0900 | 0.7730 | 0.3080 | 0.4629 | 0.7040 | | llm-jp-3-1.8b-instruct | 0.4596 | 0.4280 | 0.1987 | 0.3250 | 0.3300 | 0.4200 | 0.7900 | 0.3520 | 0.4698 | 0.8224 | | llm-jp-3-3.7b | 0.4231 | 0.3812 | 0.2440 | 0.2200 | 0.1900 | 0.3600 | 0.7947 | 0.3800 | 0.4688 | 0.7694 | | llm-jp-3-3.7b-instruct | 0.5188 | 0.4191 | 0.2504 | 0.3400 | 0.5000 | 0.5800 | 0.8166 | 0.4500 | 0.4881 | 0.8247 | | llm-jp-3-13b | 0.5802 | 0.5570 | 0.2593 | 0.4600 | 0.7000 | 0.6300 | 0.8292 | 0.3460 | 0.5937 | 0.8469 | | llm-jp-3-13b-instruct | 0.6168 | 0.5408 | 0.2757 | 0.4950 | 0.9200 | 0.7100 | 0.8317 | 0.4640 | 0.4642 | 0.8500 | We evaluated the models using `gpt-4-0613`. Please see the codes for details. | Model name | average | coding | extraction | humanities | math | reasoning | roleplay | stem | writing | | :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | llm-jp-3-1.8b-instruct | 4.93 | 1.50 | 4.70 | 7.80 | 1.55 | 2.60 | 7.80 | 6.10 | 7.40 | | llm-jp-3-3.7b-instruct | 5.50 | 1.95 | 4.05 | 8.25 | 2.25 | 4.00 | 8.80 | 7.25 | 7.45 | | llm-jp-3-13b-instruct | 6.47 | 3.15 | 7.05 | 9.15 | 3.75 | 5.40 | 8.30 | 7.50 | 7.45 | The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

NaNK
llama
3,396
13

lllm-jp-roberta-random-init

2,559
0

llm-jp-3-8x13b

NaNK
license:apache-2.0
2,155
0

llm-jp-3-150m

llama
2,096
1

llm-jp-3-3.7b

This repository provides large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. | Model Variants | | :--- | | llm-jp-3-1.8b | | llm-jp-3-1.8b-instruct | | llm-jp-3-3.7b | | llm-jp-3-3.7b-instruct | | llm-jp-3-13b | | llm-jp-3-13b-instruct | | llm-jp-3-172b-beta1 | | llm-jp-3-172b-beta1-instruct | - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 - Model type: Transformer-based Language Model - Total seen tokens: 2.1T |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |1.8b|24|2048|16|4096|407,896,064|1,459,718,144| |3.7b|28|3072|24|4096|611,844,096|3,171,068,928| |13b|40|5120|40|4096|1,019,740,160|12,688,184,320| The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v3.0`. Please refer to README.md of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary). The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---|:---|---:| |Japanese|Wikipedia|2.6B ||Common Crawl|762.8B ||WARP/PDF|237.3B ||WARP/HTML|2.7B ||Kaken|1.8B |English|Wikipedia|4.7B ||Dolma/CC-head|608.5B ||Dolma/C4|181.6B ||Dolma/Reddit|83.1B ||Dolma/PeS2o|62.9B ||Dolma/Gutenberg|5.5B ||Dolma/Wiki|3.9B |Code|The Stack|114.1B |Chinese|Wikipedia|0.8B |Korean|Wikipedia|0.3B The models have been fine-tuned on the following datasets. | Language | Dataset | description | |:---|:---|:---| |Japanese|ichikara-instruction-004-002| A manually constructed instruction dataset | | |answer-carefully-002| A manually constructed instruction dataset focusing on LLMs' safety | | |ichikara-instruction-format| A small amount of instruction dataset edited from ichikara-instruction, with some constraints on the output format. | | |AutoMultiTurnByCalm3-22B| A synthetic instruction dataset. | | |ramdom-to-fixed-multiturn-Calm3| A synthetic instruction dataset. | | |wizardlm8x22b-logical-math-coding-sftadditional-ja| A synthetic instruction dataset. | | |Synthetic-JP-EN-Coding-Dataset-567k| A synthetic instruction dataset. We used sampled one.| |English |FLAN | We used sampled one. | We evaluated the models using 100 examples from the dev split. | Model name | average | EL | FA | HE | MC | MR | MT | NLI | QA | RC | | :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | llm-jp-3-1.8b | 0.3767 | 0.3725 | 0.1948 | 0.2350 | 0.2500 | 0.0900 | 0.7730 | 0.3080 | 0.4629 | 0.7040 | | llm-jp-3-1.8b-instruct | 0.4596 | 0.4280 | 0.1987 | 0.3250 | 0.3300 | 0.4200 | 0.7900 | 0.3520 | 0.4698 | 0.8224 | | llm-jp-3-3.7b | 0.4231 | 0.3812 | 0.2440 | 0.2200 | 0.1900 | 0.3600 | 0.7947 | 0.3800 | 0.4688 | 0.7694 | | llm-jp-3-3.7b-instruct | 0.5188 | 0.4191 | 0.2504 | 0.3400 | 0.5000 | 0.5800 | 0.8166 | 0.4500 | 0.4881 | 0.8247 | | llm-jp-3-13b | 0.5802 | 0.5570 | 0.2593 | 0.4600 | 0.7000 | 0.6300 | 0.8292 | 0.3460 | 0.5937 | 0.8469 | | llm-jp-3-13b-instruct | 0.6168 | 0.5408 | 0.2757 | 0.4950 | 0.9200 | 0.7100 | 0.8317 | 0.4640 | 0.4642 | 0.8500 | We evaluated the models using `gpt-4-0613`. Please see the codes for details. | Model name | average | coding | extraction | humanities | math | reasoning | roleplay | stem | writing | | :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | llm-jp-3-1.8b-instruct | 4.93 | 1.50 | 4.70 | 7.80 | 1.55 | 2.60 | 7.80 | 6.10 | 7.40 | | llm-jp-3-3.7b-instruct | 5.50 | 1.95 | 4.05 | 8.25 | 2.25 | 4.00 | 8.80 | 7.25 | 7.45 | | llm-jp-3-13b-instruct | 6.47 | 3.15 | 7.05 | 9.15 | 3.75 | 5.40 | 8.30 | 7.50 | 7.45 | The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

NaNK
llama
2,021
10

llm-jp-3-13b-instruct

NaNK
llama
1,838
31

llm-jp-3-7.2b-instruct2

NaNK
llama
1,307
0

llm-jp-1.3b-v1.0

NaNK
license:apache-2.0
1,242
15

llm-jp-13b-v1.0

This repository provides large language models developed by LLM-jp, a collaborative project launched in Japan. | Model Variant | | :--- | |Instruction models| | llm-jp-13b-instruct-full-jaster-v1.0 | | llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | | llm-jp-13b-instruct-full-dolly-oasst-v1.0 | | llm-jp-13b-instruct-lora-jaster-v1.0 | | llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 | | llm-jp-13b-instruct-lora-dolly-oasst-v1.0 | | | | :--- | |Pre-trained models| | llm-jp-13b-v1.0 | | llm-jp-1.3b-v1.0 | Checkpoints format: Hugging Face Transformers (Megatron-DeepSpeed format models are available here) - torch>=2.0.0 - transformers>=4.34.0 - tokenizers>=0.14.0 - accelerate==0.23.0 - Model type: Transformer-based Language Model - Total seen tokens: 300B |Model|Params|Layers|Hidden size|Heads|Context length| |:---:|:---:|:---:|:---:|:---:|:---:| |13b model|13b|40|5120|40|2048| |1.3b model|1.3b|24|2048|16|2048| - Pre-training: - Hardware: 96 A100 40GB GPUs (mdx cluster) - Software: Megatron-DeepSpeed - Instruction tuning: - Hardware: 8 A100 40GB GPUs (mdx cluster) - Software: TRL, PEFT, and DeepSpeed Tokenizer The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v2.1 (50k)`. Please refer to README.md of `llm-ja-tokenizer` for details on the vocabulary construction procedure. - Model: Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0` - Training algorithm: SentencePiece Unigram byte-fallback - Training data: A subset of the datasets for model pre-training - Vocabulary size: 50,570 (mixed vocabulary of Japanese, English, and source code) The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---:|:---:|:---:| |Japanese|Wikipedia|1.5B ||mC4|136B |English|Wikipedia|5B ||The Pile|135B |Codes|The Stack|10B The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens. We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data. The models have been fine-tuned on the following datasets. | Language | Dataset | description | |:---|:---:|:---:| |Japanese|jaster| An automatically transformed data from the existing Japanese NLP datasets | ||databricks-dolly-15k| A translated one by DeepL in LLM-jp | ||OpenAssistant Conversations Dataset| A translated one by DeepL in LLM-jp | Evaluation You can view the evaluation results of several LLMs on this leaderboard. We used llm-jp-eval for the evaluation. The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations. Model Card Authors The names are listed in alphabetical order. Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takumi Okamoto.

NaNK
license:apache-2.0
1,193
41

llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0

NaNK
license:apache-2.0
1,151
8

llm-jp-13b-instruct-full-jaster-v1.0

NaNK
license:apache-2.0
1,051
15

llm-jp-13b-instruct-full-dolly-oasst-v1.0

NaNK
license:apache-2.0
1,036
4

llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1

NaNK
license:apache-2.0
714
2

llm-jp-modernbert-base

license:apache-2.0
676
9

llm-jp-3.1-13b

NaNK
llama
616
2

llm-jp-4-32b-a3b-thinking

NaNK
license:apache-2.0
557
16

llm-jp-3-3.7b-instruct

NaNK
llama
538
11

Llm Jp 3.1 1.8b

NaNK
llama
538
5

llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0

NaNK
llama
341
3

llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0

NaNK
llama
324
0

llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0

NaNK
llama
291
1

llm-jp-13b-v2.0

NaNK
llama
242
15

llm-jp-3-980m

llama
236
3

llm-jp-3-7.2b-instruct3

NaNK
llama
230
4

Llama-Mimi-1.3B

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens | 🤗 HuggingFace  | 📄 Paper  | 🗣️ Online Demo  | 🧑‍💻 Code  | Introduction Llama-Mimi is a speech language model that uses a unified tokenizer (Mimi) and a single Transformer decoder (Llama) to jointly model sequences of interleaved semantic and acoustic tokens. Trained on ~240k hours of English audio, Llama-Mimi achieves state-of-the-art performance in acoustic consistency on SALMon and effectively preserves speaker identity. Visit our demo site to hear generated speech samples. | Models | 🤗 Hugging Face | |-------|-------| | Llama-Mimi-1.3B | llm-jp/Llama-Mimi-1.3B | | Llama-Mimi-8B | llm-jp/Llama-Mimi-8B | Generate audio continuations from a given audio prompt: Check out our repository: https://github.com/llm-jp/llama-mimi

NaNK
llama
227
6

llm-jp-4-vl-9b-beta

NaNK
license:apache-2.0
192
6

llm-jp-3-150m-instruct3

llama
161
2

llm-jp-3-7.2b-instruct

NaNK
llama
145
1

llm-jp-3-vila-14b

NaNK
llava_llama
141
11

llm-jp-3-1.8b-instruct3

NaNK
llama
130
2

llm-jp-3-440m

llama
130
1

llm-jp-3-13b-instruct3

NaNK
llama
112
8

Llm Jp 3.1 8x13b Instruct4

LLM-jp-3.1 is a series of large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics. Building upon the LLM-jp-3 series, the LLM-jp-3.1 models incorporate mid-training (instruction pre-training), which significantly enhances their instruction-following capabilities compared to the original LLM-jp-3 models. This repository provides the llm-jp-3.1-8x13b-instruct4 model. For an overview of the LLM-jp-3.1 models across different parameter sizes, please refer to: - LLM-jp-3.1 Pre-trained Models - LLM-jp-3.1 Fine-tuned Models. For more details on the training procedures and evaluation results, please refer to this blog post (in Japanese). - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 - Model type: Transformer-based Language Model - Architectures: Dense model: |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |1.8b|24|2048|16|4096|407,498,752|1,459,718,144| |13b|40|5120|40|4096|1,018,746,880|12,688,184,320| MoE model: |Params|Layers|Hidden size|Heads|Routed Experts|Activated Experts|Context length|Embedding parameters|Non-embedding parameters|Activated parameters|Total parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |8x13b|40|5120|40|8|2|4096|1,018,746,880|72,144,081,920|22,200,806,400|73,162,828,800| The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from `llm-jp-tokenizer v3.0`. Please refer to README.md of `llm-jp-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary). The models have been pre-trained using a blend of the following datasets. | Language | Dataset | Tokens| |:---|:---|---:| |Japanese|Wikipedia|2.6B ||Common Crawl|762.8B ||WARP/PDF|237.3B ||WARP/HTML|2.7B ||Kaken|1.8B |English|Wikipedia|4.7B ||Dolma/CC-head|608.5B ||Dolma/C4|181.6B ||Dolma/Reddit|83.1B ||Dolma/PeS2o|62.9B ||Dolma/Gutenberg|5.5B ||Dolma/Wiki|3.9B |Code|The Stack|114.1B |Chinese|Wikipedia|0.8B |Korean|Wikipedia|0.3B In the LLM-jp-3.1 series, we performed continuous pre-training based on Instruction Pre-Training. Instruction Pre-Training enhances a model’s ability to follow instructions by continuing pre-training on a large collection of instruction–response pairs. We prepared approximately 90B tokens of instruction–response data and mixed it with our pre-training datasets, conducting continuous pre-training on a total of 400B tokens. Each model was initialized from existing checkpoints (llm-jp/llm-jp-3-1.8b, llm-jp/llm-jp-3-13b, and llm-jp/llm-jp-3-8x13b) and underwent continuous instruction pre-training. Since the LLM-jp-3 series was originally pre-trained on 2.1T tokens, the total pre-training token count amounts to 2.5T tokens. Details of this training process will be released in a forthcoming paper. The instruction–response dataset used for this training will also be made publicly available. We have fine-tuned the pre-trained checkpoint with supervised fine-tuning and further aligned it with Direct Preference Optimization. Supervised Fine-tuning The datasets used for supervised fine-tuning are as follows: | Language | Dataset | Description | |:---|:---|:---| |Japanese|ichikara-instruction-004-002| A manually constructed instruction dataset. | | |AnswerCarefully (ver2.0)| A manually constructed instruction dataset focusing on LLMs' safety. | | |ichikara-instruction-format| A small subset of the ichikara-instruction dataset, edited with some constraints on the output format. | | |AutoMultiTurnByCalm3-22B| A synthetic instruction dataset. | | |ramdom-to-fixed-multiturn-Calm3| A synthetic instruction dataset. | | |wizardlm8x22b-logical-math-coding-sft-ja| A synthetic instruction dataset. | | |magpie-sft-v1.0| A synthetic instruction dataset we created. | | |jaster v1.4.1| - | | |extraction-wiki-ja| A synthetic instruction dataset we created. | |English|Daring-Anteater| - | |Japanese & English|Synthetic-JP-EN-Coding-Dataset| A synthetic instruction dataset. | For Direct Preference Optimization (DPO), we adopted rejection sampling. Prompts were sampled from the dataset used in SFT, and multiple responses were generated for each prompt. These responses were then scored (by Qwen/Qwen2.5-32B-Instruct), and DPO was performed by treating high-scoring responses as positive examples and low-scoring responses as negative examples. We conducted DPO in two stages. In the second stage, we additionally used ac-self-inst, a Japanese preference dataset focused on safety. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model Name | JA | EN | |:------------------------------------------------------------------------------------------------------------------------------|----------:|-------:| | gpt-35-turbo-1106 | 6.48 | 7.56 | | gpt-4-0613 | 7.29 | 7.72 | | gpt-4o-2024-08-06 | 8.10 | 8.38 | | sbintuitions/sarashina2.2-1b-instruct-v0.1 | 5.30 | 5.66 | | sbintuitions/sarashina2.2-3b-instruct-v0.1 | 7.07 | 6.96 | | Rakuten/RakutenAI-2.0-8x7B-instruct | 6.68 | 6.33 | | cyberagent/calm3-22b-chat | 6.86 | 6.77 | | Qwen/Qwen2.5-14B-Instruct | 7.07 | 7.99 | | Qwen/Qwen2.5-32B-Instruct | 7.64 | 8.27 | | Qwen/Qwen3-1.7B | 5.46 | 6.95 | | Qwen/Qwen3-14B | 8.00 | 8.30 | | Qwen/Qwen3-32B | 8.36 | 8.33 | | tokyotech-llm/Llama-3.3-Swallow-70B-Instruct-v0.4 | 7.64 | 8.02 | | stockmark/Stockmark-2-100B-Instruct-beta | 7.42 | 7.17 | | llm-jp-3-1.8b-instruct3 | 4.64 | 4.09 | | llm-jp-3-13b-instruct3 | 6.21 | 6.13 | | llm-jp-3-8x13b-instruct3 | 6.60 | 6.49 | | llm-jp-3.1-1.8b-instruct4 | 6.30 | 5.70 | | llm-jp-3.1-13b-instruct4 | 7.37 | 7.01 | | llm-jp-3.1-8x13b-instruct4 | 7.50 | 7.05 | AnswerCarefully-Eval assesses the safety of Japanese language model outputs using the LLM-as-a-Judge approach, based on the test set from llm-jp/AnswerCarefully. We evaluated the models using `gpt-4o-2024-08-06`. The scores represent the average values obtained from three rounds of inference and evaluation. For more details, please refer to the codes. | Model name | Score | Acceptance rate (%, ↑) | Violation rate (%, ↓) | | :--- | ---: | ---: | ---: | | gpt-35-turbo-1106 | 3.98 | 71.7 | 12.6 | | gpt-4-0613 | 4.06 | 72.3 | 13.2 | | gpt-4o-2024-08-06 | 4.09 | 72.7 | 12.5 | | llm-jp-3-1.8b-instruct3 | 4.03 | 75.9 | 12.2 | | llm-jp-3-13b-instruct3 | 4.37 | 88.4 | 6.5 | | llm-jp-3-8x13b-instruct3 | 4.48 | 91.6 | 4.3 | | llm-jp-3.1-1.8b-instruct4 | 3.66 | 64.7 | 24.3 | | llm-jp-3.1-13b-instruct4 | 4.17 | 82.4 | 12.2 | | llm-jp-3.1-8x13b-instruct4 | 4.26 | 83.1 | 11.6 | The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

NaNK
license:apache-2.0
106
4

llm-jp-roberta-base

102
0

llm-jp-3-8x13b-instruct3

NaNK
license:apache-2.0
81
8

llm-jp-3-3.7b-instruct3

NaNK
llama
81
2

llm-jp-clip-vit-base-patch16

NaNK
license:apache-2.0
79
1

llm-jp-clip-vit-large-patch14

NaNK
license:apache-2.0
67
2

llm-jp-3-980m-instruct3

llama
50
3

llm-jp-3-440m-instruct3

llama
47
1

BTX-8x152M

license:apache-2.0
33
0

llm-jp-3-8x1.8b-instruct3

NaNK
license:apache-2.0
29
3

llm-jp-3-13b-instruct2

NaNK
llama
20
1

llm-jp-4-8b-instruct

NaNK
llama
19
1

Llama-Mimi-8B

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens | 🤗 HuggingFace  | 📄 Paper  | 🗣️ Online Demo  | 🧑‍💻 Code  | Introduction Llama-Mimi is a speech language model that uses a unified tokenizer (Mimi) and a single Transformer decoder (Llama) to jointly model sequences of interleaved semantic and acoustic tokens. Trained on ~240k hours of English audio, Llama-Mimi achieves state-of-the-art performance in acoustic consistency on SALMon and effectively preserves speaker identity. Visit our demo site to hear generated speech samples. | Models | 🤗 Hugging Face | |-------|-------| | Llama-Mimi-1.3B | llm-jp/Llama-Mimi-1.3B | | Llama-Mimi-8B | llm-jp/Llama-Mimi-8B | Generate audio continuations from a given audio prompt: Check out our repository: https://github.com/llm-jp/llama-mimi

NaNK
llama
16
8

llm-jp-3-172b-alpha2

NaNK
llama
15
0

waon-siglip2-base-patch16-256

NaNK
license:apache-2.0
14
1

llm-jp-3-440m-instruct2

llama
13
0

llm-jp-moshi-v1

license:apache-2.0
12
10

llm-jp-3-172b-instruct3

NaNK
llama
12
10

llm-jp-3-7.2b

NaNK
llama
11
1

llm-jp-3-980m-instruct2

llama
9
3

llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0

NaNK
license:apache-2.0
9
1

llm-jp-3-8x13b-instruct2

NaNK
license:apache-2.0
9
0

optimal-sparsity-math-d2048-E8-k8-3.9B-A3.9B

NaNK
license:apache-2.0
8
0

optimal-sparsity-math-d2048-E32-k4-13.6B-A2.3B

NaNK
license:apache-2.0
7
0

optimal-sparsity-math-d2048-E128-k16-52.2B-A7.1B

NaNK
license:apache-2.0
7
0

optimal-sparsity-math-d512-E256-k2-6.6B-A170M

NaNK
license:apache-2.0
7
0

optimal-sparsity-math-d1024-E32-k4-3.5B-A670M

NaNK
license:apache-2.0
7
0

optimal-sparsity-math-d512-E64-k8-1.7B-A320M

NaNK
license:apache-2.0
7
0

llm-jp-3-172b-alpha1-instruct

NaNK
llama
7
0

optimal-sparsity-math-d1024-E16-k2-1.9B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
6
0

optimal-sparsity-math-d512-E8-k2-320M-A170M

license:apache-2.0
6
0

optimal-sparsity-math-d1024-E8-k2-1.1B-A470M

NaNK
license:apache-2.0
6
0

optimal-sparsity-math-d2048-E16-k8-7.1B-A3.9B

NaNK
license:apache-2.0
6
0

optimal-sparsity-math-d512-E64-k2-1.7B-A170M

NaNK
license:apache-2.0
6
0

optimal-sparsity-math-d512-E256-k4-6.6B-A220M

NaNK
license:apache-2.0
6
0

optimal-sparsity-math-d1024-E64-k4-6.7B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
6
0

llm-jp-3-150m-instruct2

llama
6
0

llm-jp-13b-instruct-lora-jaster-v1.0

NaNK
license:apache-2.0
5
2

llm-jp-13b-instruct-lora-dolly-oasst-v1.0

NaNK
license:apache-2.0
5
1

optimal-sparsity-code-d512-E64-k2-1.7B-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d512-E128-k2-3.3B-A170M

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d512-E128-k4-3.3B-A220M

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d512-E128-k16-3.3B-A520M

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d1024-E32-k16-3.5B-A1.9B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d512-E32-k16-920M-A520M

license:apache-2.0
5
0

optimal-sparsity-math-d1024-E16-k8-1.9B-A1.1B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d1024-E128-k8-13.2B-A1.1B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d2048-E16-k2-7.1B-A1.5B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d2048-E32-k16-13.6B-A7.1B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d1024-E64-k8-6.7B-A1.1B

NaNK
license:apache-2.0
5
0

optimal-sparsity-math-d2048-E64-k8-26.4B-A3.9B

NaNK
license:apache-2.0
5
0

optimal-sparsity-code-d512-E16-k16-520M-A520M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
4
0

optimal-sparsity-code-d1024-E128-k4-13.2B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
4
0

optimal-sparsity-code-d512-E128-k16-3.3B-A520M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
4
0

optimal-sparsity-code-d512-E256-k4-6.6B-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d1024-E32-k2-3.5B-A470M

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d2048-E128-k4-52.2B-A2.3B

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d512-E16-k16-520M-A520M

license:apache-2.0
4
0

optimal-sparsity-math-d1024-E128-k16-13.2B-A1.9B

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d1024-E256-k16-26.0B-A1.9B

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d512-E32-k4-920M-A220M

license:apache-2.0
4
0

optimal-sparsity-math-d512-E8-k4-320M-A220M

license:apache-2.0
4
0

optimal-sparsity-math-d512-E16-k8-520M-A320M

license:apache-2.0
4
0

optimal-sparsity-math-d512-E32-k8-920M-A320M

license:apache-2.0
4
0

optimal-sparsity-math-d2048-E128-k8-52.2B-A3.9B

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d2048-E16-k4-7.1B-A2.3B

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d512-E64-k4-1.7B-A220M

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d1024-E8-k4-1.1B-A670M

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d1024-E128-k4-13.2B-A670M

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d1024-E256-k4-26.0B-A670M

NaNK
license:apache-2.0
4
0

optimal-sparsity-math-d512-E256-k16-6.6B-A520M

NaNK
license:apache-2.0
4
0

optimal-sparsity-code-d1024-E8-k4-1.1B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
3
0

optimal-sparsity-code-d512-E8-k2-320M-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
3
0

optimal-sparsity-code-d512-E128-k2-3.3B-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
3
0

optimal-sparsity-code-d2048-E32-k8-13.6B-A3.9B

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d512-E16-k2-520M-A170M

license:apache-2.0
3
0

optimal-sparsity-math-d1024-E128-k2-13.2B-A470M

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d2048-E32-k2-13.6B-A1.5B

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d512-E128-k8-3.3B-A320M

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d1024-E32-k8-3.5B-A1.1B

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d1024-E64-k2-6.7B-A470M

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d512-E64-k16-1.7B-A520M

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d1024-E64-k16-6.7B-A1.9B

NaNK
license:apache-2.0
3
0

optimal-sparsity-math-d2048-E64-k16-26.4B-A7.1B

NaNK
license:apache-2.0
3
0

FS-8x152M

license:apache-2.0
3
0

llm-jp-3-1.8b-instruct2

NaNK
llama
3
0

llm-jp-3-1.8b-sae-l12-k32-16x-c988240

NaNK
license:apache-2.0
3
0

optimal-sparsity-code-d1024-E64-k8-6.7B-A1.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
2
0

optimal-sparsity-code-d512-E32-k8-920M-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
2
0

optimal-sparsity-code-d1024-E64-k2-6.7B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
2
0

optimal-sparsity-code-d1024-E128-k2-13.2B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
2
0

optimal-sparsity-code-d2048-E64-k2-26.4B-A1.5B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
2
0

optimal-sparsity-code-d2048-E8-k4-3.9B-A2.3B

NaNK
license:apache-2.0
2
0

optimal-sparsity-code-d1024-E256-k16-26.0B-A1.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d1024-E256-k2-26.0B-A470M

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d1024-E16-k16-1.9B-A1.9B

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d512-E32-k2-920M-A170M

license:apache-2.0
2
0

optimal-sparsity-math-d2048-E8-k4-3.9B-A2.3B

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d512-E8-k8-320M-A320M

license:apache-2.0
2
0

optimal-sparsity-math-d2048-E64-k2-26.4B-A1.5B

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d1024-E16-k4-1.9B-A670M

NaNK
license:apache-2.0
2
0

optimal-sparsity-math-d512-E256-k8-6.6B-A320M

NaNK
license:apache-2.0
2
0

FS-8x3.7B

NaNK
license:apache-2.0
2
0

DU-0.5-8x3.7B

NaNK
license:apache-2.0
2
0

llm-jp-3-8x1.8b-instruct2

NaNK
license:apache-2.0
2
0

llm-jp-3.1-8x13b

NaNK
license:apache-2.0
2
0

llm-jp-3-172b-alpha1

NaNK
llama
1
1

Dense-btx-code-expert-1.5B

NaNK
llama
1
1

optimal-sparsity-code-d1024-E8-k8-1.1B-A1.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E128-k8-13.2B-A1.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E16-k16-1.9B-A1.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E64-k16-6.7B-A1.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E128-k16-13.2B-A1.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E64-k16-1.7B-A520M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E16-k4-1.9B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E64-k4-6.7B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E32-k4-3.5B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E8-k8-320M-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E16-k8-520M-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E64-k8-1.7B-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E128-k8-3.3B-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E32-k16-920M-A520M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d1024-E32-k16-3.5B-A1.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E16-k2-520M-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E32-k2-920M-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E256-k2-6.6B-A170M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E8-k2-1.1B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E16-k2-1.9B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E32-k2-3.5B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E256-k2-26.0B-A470M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E8-k2-3.9B-A1.5B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E16-k2-7.1B-A1.5B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E32-k2-13.6B-A1.5B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E128-k2-52.2B-A1.5B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E8-k4-320M-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E16-k4-520M-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E32-k4-920M-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

license:apache-2.0
1
0

optimal-sparsity-code-d512-E64-k4-1.7B-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E128-k4-3.3B-A220M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E256-k4-26.0B-A670M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E16-k4-7.1B-A2.3B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E32-k4-13.6B-A2.3B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E64-k4-26.4B-A2.3B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E128-k4-52.2B-A2.3B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E256-k8-6.6B-A320M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d1024-E256-k8-26.0B-A1.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E8-k8-3.9B-A3.9B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E16-k8-7.1B-A3.9B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E64-k8-26.4B-A3.9B

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E128-k8-52.2B-A3.9B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d512-E256-k16-6.6B-A520M

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E16-k16-7.1B-A7.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E32-k16-13.6B-A7.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-code-d2048-E128-k16-52.2B-A7.1B

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks This repository contains model checkpoints from the paper Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks. For more details, including code and evaluation procedures, please refer to the official GitHub repository: https://github.com/rioyokotalab/optimal-sparsity If you find our work helpful, please feel free to cite the paper.

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d2048-E128-k2-52.2B-A1.5B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d512-E16-k4-520M-A220M

license:apache-2.0
1
0

optimal-sparsity-math-d2048-E8-k2-3.9B-A1.5B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d1024-E8-k8-1.1B-A1.1B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d1024-E256-k8-26.0B-A1.1B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d2048-E32-k8-13.6B-A3.9B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d2048-E16-k16-7.1B-A7.1B

NaNK
license:apache-2.0
1
0

optimal-sparsity-math-d2048-E64-k4-26.4B-A2.3B

NaNK
license:apache-2.0
1
0

FS-8x1.5B

NaNK
license:apache-2.0
1
0

Dense-13B

NaNK
llama
1
0

Dense-btx-english-expert-1.5B

NaNK
llama
1
0

Dense-btx-japanese-expert-152M

llama
1
0

llm-jp-3-3.7b-instruct2

NaNK
llama
1
0

llm-jp-3-172b-beta1

NaNK
llama
0
9

llm-jp-3-172b-beta1-instruct

NaNK
llama
0
5

llm-jp-3-172b

NaNK
llama
0
4

llm-jp-3-172b-beta2

NaNK
llama
0
3

llm-jp-3-172b-beta2-instruct2

NaNK
llama
0
3

llm-jp-13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1

NaNK
license:apache-2.0
0
1

llm-jp-13b-dpo-lora-hh_rlhf_ja-v1.1

NaNK
license:apache-2.0
0
1

llm-jp-3-172b-alpha2-instruct

NaNK
llama
0
1

RNU-0.5-8x152M

license:apache-2.0
0
1

Dense-btx-japanese-expert-1.5B

NaNK
llama
0
1

Dense-btx-code-expert-152M

llama
0
1