sbintuitions

28 models • 1 total models in database

Sort by:

modernbert-ja-130m

This repository provides Japanese ModernBERT trained by SB Intuitions. ModernBERT is a new variant of the BERT model that combines local and global attention, allowing it to handle long sequences while maintaining high computational efficiency. It also incorporates modern architectural improvements, such as RoPE. Our ModernBERT-Ja-130M is trained on a high-quality corpus of Japanese and English text comprising 4.39T tokens, featuring a vocabulary size of 102,400 and a sequence length of 8,192 tokens. You can use our models directly with the transformers library v4.48.0 or higher: Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2. We provide ModernBERT-Ja in several model sizes. Below is a summary of each model. |ID| #Param. | #Param. w/o Emb.|Dim.|Inter. Dim.|#Layers| |-|-|-|-|-|-| |sbintuitions/modernbert-ja-30m|37M|10M|256|1024|10| |sbintuitions/modernbert-ja-70m|70M|31M|384|1536|13| |sbintuitions/modernbert-ja-130m|132M|80M|512|2048|19| |sbintuitions/modernbert-ja-310m|315M|236M|768|3072|25| For all models, the vocabulary size is 102,400, the head dimension is 64, and the activation function is GELU. The configuration for global attention and sliding window attention consists of 1 layer + 2 layers (global–local–local). The sliding window attention window context size is 128, with globalropetheta set to 160,000 and localropetheta set to 10,000. We constructed the ModernBERT-Ja-130M model through a three-stage training process, which follows the original ModernBERT. First, we performed pre-training using a large corpus. Next, we conducted two phases of context length extension. 1. Pre-training - Training with 3.51T tokens, including Japanese and English data extracted from web corpora. - The sequence length is 1,024 with naive sequence packing. - Masking rate is 30% (with 80-10-10 rule). 2. Context Extension (CE): Phase 1 - Training with 430B tokens, comprising high-quality Japanese and English data. - The sequence length is 8,192 with best-fit packing. - Masking rate is 30% (with 80-10-10 rule). 3. Context Extension (CE): Phase 2 - Training with 450B tokens, comprising high-quality Japanese data. - The data consists of 150B tokens, and we trained it for 3 epochs. - This is because the overall performance of the Japanese language task improved with 3 epochs compared to just 1 epoch. - The sequence length is 8,192 without sequence packing. - Masking rate is 15% (with 80-10-10 rule). The key differences from the original ModernBERT are: 1. It is pre-trained on Japanese and English corpora, leading to a total of approximately 4.39T training tokens. 2. We observed that decreasing the mask rate in Context Extension Phase 2 from 30% to 15% improved the model's performance. 3. Our model boasts a large vocabulary size of 102,400, which is larger than most existing Japanese models. To align the number of parameters with existing models, we set the hidden size to 512 and the number of hidden layers to 19. Finally, the model has 52M parameters in the embedding layer, 80M parameters in the Transformer layers, and a total of 132M parameters. We use the tokenizer and vocabulary from sbintuitions/sarashina2-13b. Specifically, we employ a SentencePiece tokenizer with a unigram language model and byte fallback. We do not apply pre-tokenization using a Japanese tokenizer. Therefore, users can directly input raw sentences into the tokenizer without any additional preprocessing. You can use this model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Note that this model is not designed for text generation. When you want to generate a text, please use a text generation model such as Sarashina. Since the unigram language model is used as a tokenizer, the token boundaries often do not align with the morpheme boundaries, resulting in poor performance in token classification tasks such as named entity recognition and span extraction. We evaluated our model on 12 datasets, including JGLUE, across various tasks: - Knowledge-based tasks: JCommonsenseQA (JComQA), RCQA - Japanese linguistic acceptability classification: JCoLA - Natural Language Inference (NLI) tasks: JNLI, JSICK, JSNLI, Kyoto University RTE (KU RTE) - Semantic Textual Similarity (STS) task: JSTS - Various classification tasks: Livedoor news corpus (Livedoor), LLM-jp Toxicity (Toxicity), MARC-ja, WRIME v2 (WRIME) These tasks are short-sequence evaluation tasks, and we aligned our settings with those of existing models. While the maximum sequence length varies across tasks, it does not exceed 512. We set the sequence length and other experimental configurations per task, ensuring that the settings remain consistent across models. For hyperparameters, we explored the following ranges: - Learning rate: `{5e-6, 1e-5, 2e-5, 3e-5, 5e-5, 1e-4}` - Number of epochs: - Tasks with a large number of instances: `{1, 2}` - Tasks with fewer instances: `{3, 5, 10}` In the experiments, we loaded several Japanese models that are publicly available on HuggingFace using `AutoModel` and constructed classification models by appending a classification head consisting of a linear layer, a GELU activation function, and another linear layer. This was done because HuggingFace's `AutoModelForSequenceClassification` comes with different implementations for each model, and using them directly would result in classification heads that differ from one model to another. For the embeddings fed into the classification layer, we used the embedding of the special token at the beginning of the sentence. That is, `[CLS]` in BERT and ` ` in RoBERTa. Note that our model does not perform the next sentence prediction (NSP) task during pretraining, so ` ` is added at the beginning of the sentence, not ` `. Therefore, we used the ` ` token for classification. We conducted evaluations using 5-fold cross-validation. That is, we trained the model on the `train` set and evaluated it on the `validation` set. After determining the optimal hyperparameters (learning rate, epochs) based on the average performance on the `validation` sets, we report the average performance on the `test` sets with the hyperparameters. For datasets without predefined splits, we first set aside 10% of the data as the test set and then performed 5-fold cross-validation on the remaining data. For datasets such as some tasks in JGLUE, where only `train` and `validation` sets are publicly available, we treated the `validation` set as the `test` set and performed 5-fold cross-validation on the remaining data. For datasets with predefined `train`, `validation`, and `test` sets, we simply trained and evaluated the model five times with different random seeds and used the model with the best average evaluation score on the `validation` set to measure the final score on the `test` set. | Model | #Param. | #Param. w/o Emb. | Avg. | JComQA (Acc.) | RCQA (Acc.) | JCoLA (Acc.) | JNLI (Acc.) | JSICK (Acc.) | JSNLI (Acc.) | KU RTE (Acc.) | JSTS (Spearman's ρ) | Livedoor (Acc.) | Toxicity (Acc.) | MARC-ja (Acc.) | WRIME (Acc.) | | ------ | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | | ModernBERT-Ja-30M | 37M | 10M | 85.67 | 80.95 | 82.35 | 78.85 | 88.69 | 84.39 | 91.79 | 61.13 | 85.94 | 97.20 | 89.33 | 95.87 | 91.61 | | ModernBERT-Ja-70M | 70M | 31M | 86.77 | 85.65 | 83.51 | 80.26 | 90.33 | 85.01 | 92.73 | 60.08 | 87.59 | 96.34 | 91.01 | 96.13 | 92.59 | | ModernBERT-Ja-130M (this model) | 132M | 80M | 88.95 | 91.01 | 85.28 | 84.18 | 92.03 | 86.61 | 94.01 | 65.56 | 89.20 | 97.42 | 91.57 | 96.48 | 93.99 | | ModernBERT-Ja-310M | 315M | 236M | 89.83 | 93.53 | 86.18 | 84.81 | 92.93 | 86.87 | 94.48 | 68.79 | 90.53 | 96.99 | 91.24 | 96.39 | 95.23 | | | | | | | | | | | | | | | | | | | LINE DistillBERT| 68M | 43M | 85.32 | 76.39 | 82.17 | 81.04 | 87.49 | 83.66 | 91.42 | 60.24 | 84.57 | 97.26 | 91.46 | 95.91 | 92.16 | | Tohoku BERT-base v3| 111M | 86M | 86.74 | 82.82 | 83.65 | 81.50 | 89.68 | 84.96 | 92.32 | 60.56 | 87.31 | 96.91 | 93.15 | 96.13 | 91.91 | | LUKE-japanese-base-lite| 133M | 107M | 87.15 | 82.95 | 83.53 | 82.39 | 90.36 | 85.26 | 92.78 | 60.89 | 86.68 | 97.12 | 93.48 | 96.30 | 94.05 | | Kyoto DeBERTa-v3| 160M | 86M | 88.31 | 87.44 | 84.90 | 84.35 | 91.91 | 86.22 | 93.41 | 63.31 | 88.51 | 97.10 | 92.58 | 96.32 | 93.64 | | | | | | | | | | | | | | | | | | | KoichiYasuoka/modernbert-base-japanese-wikipedia| 160M | 110M | 82.41 | 62.59 | 81.19 | 76.80 | 84.11 | 82.01 | 90.51 | 60.48 | 81.74 | 97.10 | 90.34 | 94.85 | 87.25 | | llm-jp/llm-jp-modernbert-base| 187M | 110M | 86.75 | 84.29 | 83.99 | 78.00 | 90.28 | 83.76 | 93.40 | 60.32 | 87.71 | 96.64 | 92.13 | 96.33 | 94.09 | | | | | | | | | | | | | | | | | | | Tohoku BERT-large char v2| 311M | 303M | 87.23 | 85.08 | 84.20 | 81.79 | 90.55 | 85.25 | 92.63 | 61.29 | 87.64 | 96.55 | 93.26 | 96.25 | 92.29 | | Tohoku BERT-large v2| 337M | 303M | 88.36 | 86.93 | 84.81 | 82.89 | 92.05 | 85.33 | 93.32 | 64.60 | 89.11 | 97.64 | 94.38 | 96.46 | 92.77 | | Waseda RoBERTa-large (Seq. 512)| 337M | 303M | 88.37 | 88.81 | 84.50 | 82.34 | 91.37 | 85.49 | 93.97 | 61.53 | 88.95 | 96.99 | 95.06 | 96.38 | 95.09 | | Waseda RoBERTa-large (Seq. 128)| 337M | 303M | 88.36 | 89.35 | 83.63 | 84.26 | 91.53 | 85.30 | 94.05 | 62.82 | 88.67 | 95.82 | 93.60 | 96.05 | 95.23 | | LUKE-japanese-large-lite| 414M | 379M | 88.94 | 88.01 | 84.84 | 84.34 | 92.37 | 86.14 | 94.32 | 64.68 | 89.30 | 97.53 | 93.71 | 96.49 | 95.59 | | RetrievaBERT| 1.30B | 1.15B | 86.79 | 80.55 | 84.35 | 80.67 | 89.86 | 85.24 | 93.46 | 60.48 | 87.30 | 97.04 | 92.70 | 96.18 | 93.61 | | | | | | | | | | | | | | | | | | | hotchpotch/mMiniLMv2-L6-H384| 107M | 11M | 81.53 | 60.34 | 82.83 | 78.61 | 86.24 | 77.94 | 87.32 | 60.48 | 80.48 | 95.55 | 86.40 | 94.97 | 87.20 | | hotchpotch/mMiniLMv2-L12-H384| 118M | 21M | 82.59 | 62.70 | 83.77 | 78.61 | 87.69 | 79.58 | 87.65 | 60.48 | 81.55 | 95.88 | 90.00 | 94.89 | 88.28 | | mBERT| 178M | 86M | 83.48 | 66.08 | 82.76 | 77.32 | 88.15 | 84.20 | 91.25 | 60.56 | 84.18 | 97.01 | 89.21 | 95.05 | 85.99 | | XLM-RoBERTa-base| 278M | 86M | 84.36 | 69.44 | 82.86 | 78.71 | 88.14 | 83.17 | 91.27 | 60.48 | 83.34 | 95.93 | 91.91 | 95.82 | 91.20 | | XLM-RoBERTa-large| 560M | 303M | 86.95 | 80.07 | 84.47 | 80.42 | 92.16 | 84.74 | 93.87 | 60.48 | 88.03 | 97.01 | 93.37 | 96.03 | 92.72 | The evaluation results are shown in the table. `#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only. Our ModernBERT-Ja-130M, a base-sized model, outperformed Tohoku BERT-large and achieved performance comparable to LUKE-japanese-large-lite. Specifically, it demonstrated impressive results in knowledge-based tasks such as JCommonsenseQA and RCQA. Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-130M also exhibited strong performance in short-sequence evaluations. ModernBERT-Ja-130M may produce representations that reflect biases. When you use this model for masked language modeling, it may generate biases or harmful expressions.

license:mit

4,345

sarashina2-7b

NaNK

llama

3,179

sarashina2.2-vision-3b

NaNK

license:mit

2,759

tiny-lm

llama

1,910

sarashina-embedding-v2-1b

"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "Sarashina2.2-1B". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. ) This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications. - Model Type: Sentence Transformer - Base model: Sarashina2.2-1B - Maximum Sequence Length: 8,192 tokens - Output Dimensionality: 1,792 dimensions - Similarity Function: Cosine Similarity - Language: Japanese - License: Sarashina Model NonCommercial License Agreement For both the query and document sides, use different prefix formats. On the query side, add the prefix `task:` followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.) The table below provides instruction and prefix templates for five main tasks. |Task|Query Side|Document Side| |:-:|:-|:-| |Retrieval Reranking|task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery: |text: | |Clustering|task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery: | - | |Classification|task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery: | - | |STS|task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: |task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: | Sarashina-Embedding-v2-1B is created through the following three-stage learning process: Stage 1: Weakly-supervised Learning To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets. Step2: Supervised Fine-tuning To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data. Stage 3: Model Merging To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging. |Model|Avg.|Retrieval|STS|Classification|Reranking|Clustering| |:-:|:-:|:-:|:-:|:-:|:-:|:-:| |Sarashina-Embedding-v2-1B (This model)|76.38|76.48|84.22|77.14|86.28|52.56| |cl-nagoya/ruri-v3-310m|75.85|76.03|81.59|77.65|85.84|50.52| |sbintuitions/sarashina-embedding-v1-1b|74.87|74.53|81.71|77.20|84.36|50.30| |OpenAI/text-embedding-3-large|73.86|71.95|82.52|77.27|83.06|51.82| This model is licensed under Sarashina Model NonCommercial License Agreement. If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.

NaNK

llama

1,360