tohoku-nlp

23 models • 3 total models in database

Sort by:

bert-base-japanese-whole-word-masking

--- language: ja license: cc-by-sa-4.0 datasets: - wikipedia widget: - text: 東北大学で[MASK]の研究をしています。 ---

bert-base-japanese-char-v3

BERT base Japanese (character-level tokenization with whole word masking, CC-100 and jawiki-20230102) This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in unidic-lite package), followed by character-level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The model is trained on the Japanese portion of CC-100 dataset and the Japanese version of Wikipedia. For Wikipedia, we generated a text corpus from the Wikipedia Cirrussearch dump file as of January 2, 2023. The corpus files generated from CC-100 and Wikipedia are 74.3GB and 4.9GB in size and consist of approximately 392M and 34M sentences, respectively. For the purpose of splitting texts into sentences, we used fugashi with mecab-ipadic-NEologd dictionary (v0.0.7). The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 7027. We used fugashi and unidic-lite packages for the tokenization. We trained the model first on the CC-100 corpus for 1M steps and then on the Wikipedia corpus for another 1M steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3-8 instance of Cloud TPUs provided by TPU Research Cloud. The pretrained models are distributed under the Apache License 2.0. This model is trained with Cloud TPUs provided by TPU Research Cloud program.

license:apache-2.0

101,545

bert-base-japanese-char-v2

BERT base Japanese (character-level tokenization with whole word masking, jawiki-20200831) This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in unidic-lite package), followed by character-level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. We used the MeCab morphological parser with mecab-ipadic-NEologd dictionary to split texts into sentences. The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144. We used `fugashi` and `unidic-lite` packages for the tokenization. The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days to finish. The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0. This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.

license:cc-by-sa-4.0

101,033

bert-base-japanese-char

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into characters. The vocabulary size is 4000. The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0. For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

license:cc-by-sa-4.0

93,660

bert-large-japanese-v2

BERT large Japanese (unidic-lite with whole word masking, CC-100 and jawiki-20230102) This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in unidic-lite package), followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT large model; 24 layers, 1024 dimensions of hidden states, and 16 attention heads. The model is trained on the Japanese portion of CC-100 dataset and the Japanese version of Wikipedia. For Wikipedia, we generated a text corpus from the Wikipedia Cirrussearch dump file as of January 2, 2023. The corpus files generated from CC-100 and Wikipedia are 74.3GB and 4.9GB in size and consist of approximately 392M and 34M sentences, respectively. For the purpose of splitting texts into sentences, we used fugashi with mecab-ipadic-NEologd dictionary (v0.0.7). The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768. We used fugashi and unidic-lite packages for the tokenization. We trained the model first on the CC-100 corpus for 1M steps and then on the Wikipedia corpus for another 1M steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3-8 instance of Cloud TPUs provided by TPU Research Cloud. The pretrained models are distributed under the Apache License 2.0. This model is trained with Cloud TPUs provided by TPU Research Cloud program.

license:apache-2.0

69,807

bert-base-japanese-v3

BERT base Japanese (unidic-lite with whole word masking, CC-100 and jawiki-20230102) This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in unidic-lite package), followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The model is trained on the Japanese portion of CC-100 dataset and the Japanese version of Wikipedia. For Wikipedia, we generated a text corpus from the Wikipedia Cirrussearch dump file as of January 2, 2023. The corpus files generated from CC-100 and Wikipedia are 74.3GB and 4.9GB in size and consist of approximately 392M and 34M sentences, respectively. For the purpose of splitting texts into sentences, we used fugashi with mecab-ipadic-NEologd dictionary (v0.0.7). The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768. We used fugashi and unidic-lite packages for the tokenization. We trained the model first on the CC-100 corpus for 1M steps and then on the Wikipedia corpus for another 1M steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3-8 instance of Cloud TPUs provided by TPU Research Cloud. The pretrained models are distributed under the Apache License 2.0. This model is trained with Cloud TPUs provided by TPU Research Cloud program.

license:apache-2.0

63,419

bert-base-japanese

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32000. The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0. For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

license:cc-by-sa-4.0

33,733

bert-base-japanese-v2

BERT base Japanese (unidic-lite with whole word masking, jawiki-20200831) This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word-level tokenization based on the Unidic 2.1.2 dictionary (available in unidic-lite package), followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at cl-tohoku/bert-japanese. The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. We used the MeCab morphological parser with mecab-ipadic-NEologd dictionary to split texts into sentences. The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768. We used `fugashi` and `unidic-lite` packages for the tokenization. The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days to finish. The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0. This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.

license:cc-by-sa-4.0

13,435

bybert-jp-v2-100m

Llama アーキテクチャをベースとし、Causal Attention Mask を取り除くことで、Encoder 型言語モデルとして利用しています。具体的には、以下のモジュールを採用しています。 - SwiGLU - Rotary Positional Embeddings (RoPE) - Grouped Query Attention (GQA) llm-jp-corpus-v3 の日本語コーパスのサブセット (ja\cc, ja\warp\html, ja\warp\pdf, ja\wiki, kaken) を使用しました。また、学習時には Whole Word Masking を実施しています。 Whole Word Masking 単語分割器には、vibrato を利用しました。辞書は bccwj-suw+unidic-cwj-311 を用いています。モデルの重みを初期化した Llama アーキテクチャベースの Encoder モデルを from scratch で学習させています。各モデルの学習設定は以下の通りです。 | | Params. | Tokens | Steps | Batch Size (tokens) | | --- | --- | --- | --- | --- | | tohoku-nlp/bybert-jp-100m | 107 M | 623 B | 198,000 | 3,145,728 | | tohoku-nlp/bybert-jp-200m | 205 M | 637 B | 270,000 | 2,359,296 | | tohoku-nlp/bybert-jp-400m | 397 M | 1.23 T | 308,000 | 3,981,312 | | tohoku-nlp/bybert-jp-v2-100m | 114 M | 2.76 T | 330,000 | 8,388,608 | 学習には、Masked Language Modeling (MLM) のみ実施し、Next Sentence Prediction (NSP) は実施していません。また，tohoku-nlp/bybert-jp-v2-100mでは - 学習データ量を2.85T tokensに増やす - unicodeのencodingに独自形式を採用 - マスク率を50%で学習．その後30%に減少 - QKVの線形変換にバイアス項を追加 - batch sizeのwarmupを導入 | | bybert-jp-100,200,400m | bybert-jp-next-100m | | ---- | ---- | ---- | | Max Learning Rate | 1.0E-3 | 1.0E-3 | | Min Learning Rate | 1.0E-6 | 1.0E-6 | | Learning Rate Warmup Steps | 2,000 | 2,000 | | Scheduler | cosine | cosine | | Optimizer | AdamW | AdamW | | Optimizer Config | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | | Weight Decay | 0.01 | 0.01 | | Gradient Clipping | 1.0 | 1.0 | | Sequence Length | 3,072 | 4,096 | | MLM Probability | 0.3 | 0.5 -> 0.3 | | Replace Masked-token Probability | 0.8 | 0.8 | | Replace Random-token Probability | 0.1 | 0.1 | 評価指標として、単語のマスクされた単語の予測正解率を用いた。実験設定の詳細は工藤 et al. (2025) を参照してください。評価結果は以下の通りです。 | | ichikara | wiki | |--------------------------------|----------|------| | tohoku-nlp/bybert-jp-100m | 58.0 | 26.3 | | tohoku-nlp/bybert-jp-200m | 60.5 | 33.0 | | tohoku-nlp/bybert-jp-400m | 67.4 | 38.5 | | tohoku-nlp/bybert-jp-v2-100m | 63.4 | 40.5 | その他， - モデルアーキテクチャ探索 - ハイパーパラメータ探索 - 内部機序等のパフォーマンス以外の側面からの分析についても工藤 et al. (2025) を参照してください。本モデルの作者は本モデルを作成するにあたって、その内容、機能等について細心の注意を払っておりますが、モデルの出力が正確であるかどうか、安全なものであるか等について保証をするものではなく、何らの責任を負うものではありません。本モデルの利用により、万一、利用者に何らかの不都合や損害が発生したとしても、モデルやデータセットの作者や作者の所属組織は何らの責任を負うものではありません。このモデルの学習にあたり様々な面でご協力いただきました Tohoku NLP Group の皆様に感謝いたします。作成者 - Keito Kudo - Go Kamoda - Daiki Shiono - Jun Suzuki A Japanese BERT model that adopts a byte-level tokenizer. We have confirmed operation with transformers version 4.56.1. Based on the Llama architecture, we use it as an Encoder-type language model by removing the Causal Attention Mask. Specifically, we adopt the following modules: - SwiGLU - Rotary Positional Embeddings (RoPE) - Grouped Query Attention (GQA) We used a subset of Japanese corpora (jacc, jawarphtml, jawarppdf, jawiki, kaken) from llm-jp-corpus-v3. Additionally, we implemented Whole Word Masking during training. For the Whole Word Masking word segmenter, we used vibrato. We used the bccwj-suw+unidic-cwj-311 dictionary. We trained the Llama architecture-based Encoder model with initialized weights from scratch. The training configuration for each model is as follows: | | Params. | Tokens | Steps | Batch Size (tokens) | | --- | --- | --- | --- | --- | | tohoku-nlp/bybert-jp-100m | 107 M | 623 B | 198,000 | 3,145,728 | | tohoku-nlp/bybert-jp-200m | 205 M | 637 B | 270,000 | 2,359,296 | | tohoku-nlp/bybert-jp-400m | 397 M | 1.23 T | 308,000 | 3,981,312 | | tohoku-nlp/bybert-jp-v2-100m | 114 M | 2.76 T | 330,000 | 8,388,608 | Training was performed using only Masked Language Modeling (MLM), without Next Sentence Prediction (NSP). Additionally, for tohoku-nlp/bybert-jp-v2-100m: - Increased training data volume to 2.85T tokens - Adopted proprietary format for unicode encoding - Trained with 50% mask rate, then reduced to 30% - Added bias term to QKV linear transformations - Introduced batch size warmup Through these improvements, we achieved relatively high performance despite being a small-scale model. | | bybert-jp-100,200,400m | bybert-jp-next-100m | | ---- | ---- | ---- | | Max Learning Rate | 1.0E-3 | 1.0E-3 | | Min Learning Rate | 1.0E-6 | 1.0E-6 | | Learning Rate Warmup Steps | 2,000 | 2,000 | | Scheduler | cosine | cosine | | Optimizer | AdamW | AdamW | | Optimizer Config | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | | Weight Decay | 0.01 | 0.01 | | Gradient Clipping | 1.0 | 1.0 | | Sequence Length | 3,072 | 4,096 | | MLM Probability | 0.3 | 0.5 -> 0.3 | | Replace Masked-token Probability | 0.8 | 0.8 | | Replace Random-token Probability | 0.1 | 0.1 | For training, we use a codebase based on Megatron-LM with our own modifications. We used the prediction accuracy of masked words as the evaluation metric. For details of the experimental setup, please refer to Kudo et al. (2025). The evaluation results are as follows: | | ichikara | wiki | |--------------------------------|----------|------| | tohoku-nlp/bybert-jp-100m | 58.0 | 26.3 | | tohoku-nlp/bybert-jp-200m | 60.5 | 33.0 | | tohoku-nlp/bybert-jp-400m | 67.4 | 38.5 | | tohoku-nlp/bybert-jp-v2-100m | 63.4 | 40.5 | For other aspects including: - Model architecture exploration - Hyperparameter exploration - Analysis from non-performance perspectives such as internal mechanisms This model is distributed under the Apache License 2.0. While the authors of this model have paid careful attention to its content and functionality in creating this model, they do not warrant that the model's output is accurate or safe, and assume no responsibility whatsoever. The authors of the model and dataset and their affiliated organizations assume no responsibility for any inconvenience or damage that may occur to users through the use of this model. We would like to thank everyone at Tohoku NLP Group for their cooperation in various aspects of training this model. Creators - Keito Kudo - Go Kamoda - Daiki Shiono - Jun Suzuki

llama_enc

tohokunlp-bert-500m-sq4096-alpha

llama_enc

tohokunlp-bert-500m-sq8192-alpha

llama_enc

bert-large-japanese-char

license:cc-by-sa-4.0

bybert-jp-200m

llama_enc

stable-diffusion-xl-jp-base-1.0

—

roberta-base-japanese

—

bygpt-jp-multi-lm-head-6.5B-alpha

バイト単位のtokenizerを採用した，日本語言語モデルです。一度に4tokens (bytes) ずつ予測するための，複数のlmヘッドを持つアーキテクチャを採用しています。また，multi byte predictionに適した独自のUnicode encodingを採用しています。現在開発段階のモデルであり，十分な性能には達していません。 transformers version 4.56.1 において、動作確認しています。他のバージョンでは動作しない可能性があります。利用上の注意点本モデルは，1度に4bytes (tokens) ずつ予測するため，特殊tokenも複数トークン (bytes) で構成されています．そのため，例えばtokenizer.eos\tokenはlist of intです．また，generate関数はcustom\generateの機能により実装されており，利用可能な機能に制限があります．また，このモデルはinstruction tuning等は実施していないモデルです． - SwiGLU - Rotary Positional Embeddings (RoPE) - Grouped Query Attention (GQA) また，4tokens (bytes) ずつ予測するため， - 4つのlmヘッド - 入力のembeddingを4tokenごとにマージするモジュールを追加しています。 llm-jp-corpus-v3 の日本語コーパスのサブセット (ja\cc, ja\warp\html, ja\warp\pdf, ja\wiki, kaken) を使用しました。 | | tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha | | ---- | ---- | | Training Steps | 208,000 | | Batch Size (tokens) | 5,898,240 | | Max Learning Rate | 5.0E-4 | | Min Learning Rate | 1.0E-5 | | Learning Rate Warmup Steps | 2,000 | | Scheduler | cosine | | Optimizer | AdamW | | Optimizer Config | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | | Weight Decay | 0.01 | | Gradient Clipping | 1.0 | | Sequence Length | 11,520 | 本モデルの作者は本モデルを作成するにあたって、その内容、機能等について細心の注意を払っておりますが、モデルの出力が正確であるかどうか、安全なものであるか等について保証をするものではなく、何らの責任を負うものではありません。本モデルの利用により、万一、利用者に何らかの不都合や損害が発生したとしても、モデルやデータセットの作者や作者の所属組織は何らの責任を負うものではありません。このモデルの学習にあたり様々な面でご協力いただきました Tohoku NLP Group の皆様に感謝いたします。作成者 - Keito Kudo - Go Kamoda - Daiki Shiono - Jun Suzuki This is a Japanese language model that adopts a byte-level tokenizer. It adopts an architecture with multiple LM heads for predicting 4 tokens (bytes) at once. It also adopts a unique Unicode encoding suitable for multi-byte prediction. This is currently a model in development stage and has not yet reached sufficient performance. Operation has been confirmed with transformers version 4.56.1. It may not work with other versions. Important Notes for Usage Since this model predicts 4 bytes (tokens) at once, special tokens are also composed of multiple tokens (bytes). Therefore, for example, tokenizer.eostoken is a list of int. Also, the generate function is implemented through customgenerate functionality, which has limitations on available features. Additionally, this model has not undergone instruction tuning. Based on the Llama architecture. Specifically, it adopts the following modules: - SwiGLU - Rotary Positional Embeddings (RoPE) - Grouped Query Attention (GQA) Also, for predicting 4 tokens (bytes) at once, we have added: - 4 LM heads - A module to merge input embeddings every 4 tokens We used a subset of the Japanese corpus from llm-jp-corpus-v3 (jacc, jawarphtml, jawarppdf, jawiki, kaken). | | tohoku-nlp/bygpt-jp-multi-lm-head-6.5B-alpha | | ---- | ---- | | Training Steps | 170,000 | | Batch Size (tokens) | 5,898,240 | | Max Learning Rate | 5.0E-4 | | Min Learning Rate | 1.0E-5 | | Learning Rate Warmup Steps | 2,000 | | Scheduler | cosine | | Optimizer | AdamW | | Optimizer Config | beta1 = 0.9, beta2 = 0.999, eps = 1.0E-8 | | Weight Decay | 0.01 | | Gradient Clipping | 1.0 | | Sequence Length | 11,520 | For training, we used a codebase based on Megatron-LM with our own custom modifications. This model is distributed under the Apache License 2.0. While the authors of this model have paid careful attention to its content and functionality during creation, we do not guarantee that the model's outputs are accurate or safe, and we assume no responsibility for them. Even if users experience any inconvenience or damage due to the use of this model, the authors of the model and dataset and their affiliated organizations assume no responsibility. We thank all members of the Tohoku NLP Group who cooperated with us in various aspects of training this model. Authors - Keito Kudo - Go Kamoda - Daiki Shiono - Jun Suzuki

NaNK

byllama_patch