rinna
japanese-cloob-vit-b-16
--- language: ja thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png license: apache-2.0 tags: - feature-extraction - clip - cloob - vision inference: false ---
japanese-gpt-neox-small
--- language: ja thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png tags: - gpt-neox - text-generation - lm - nlp license: mit datasets: - cc100 - Wikipedia - mc4 inference: false ---
japanese-hubert-base
--- thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png language: ja license: apache-2.0 datasets: reazon-research/reazonspeech inference: false tags: - hubert - speech ---
japanese-clip-vit-b-16
This is a Japanese CLIP (Contrastive Language-Image Pre-Training) model trained by rinna Co., Ltd.. Please see japanese-clip for the other available models. Model architecture The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer BERT as a text encoder. The image encoder was initialized from the AugReg `vit-base-patch16-224` model. Training The model was trained on CC12M translated the captions to Japanese.
japanese-gpt-neox-3.6b-instruction-sft
japanese-gpt2-small
This repository provides a small-sized Japanese GPT-2 model. The model was trained using code from Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd. ~~~~ from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt2-small", usefast=False) tokenizer.dolowercase = True # due to some bug of tokenizer config loading model = AutoModelForCausalLM.frompretrained("rinna/japanese-gpt2-small") ~~~~ Model architecture A 12-layer, 768-hidden-size transformer-based language model. Training The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective on 8\\V100 GPUs for around 15 days. It reaches around 21 perplexity on a chosen validation set from CC-100. Tokenization The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.
japanese-roberta-base
This repository provides a base-sized Japanese RoBERTa model. The model was trained using code from Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd. ~~~~ from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.frompretrained("rinna/japanese-roberta-base", usefast=False) tokenizer.dolowercase = True # due to some bug of tokenizer config loading model = AutoModelForMaskedLM.frompretrained("rinna/japanese-roberta-base") ~~~~ To predict a masked token, be sure to add a `[CLS]` token before the sentence for the model to correctly encode it, as it is used during the model training. A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions. Note 3: Provide `positionids` as an argument explicitly When `positionids` are not provided for a `Roberta` model, Huggingface's `transformers` will automatically construct it but start from `paddingidx` instead of `0` (see issue and function `createpositionidsfrominputids()` in Huggingface's implementation), which unfortunately does not work as expected with `rinna/japanese-roberta-base` since the `paddingidx` of the corresponding tokenizer is not `0`. So please be sure to constrcut the `positionids` by yourself and make it start from position id `0`. Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API. tokenize tokens = tokenizer.tokenize(text) print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。'] mask a token maskedidx = 5 tokens[maskedidx] = tokenizer.masktoken print(tokens) # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。'] convert to ids tokenids = tokenizer.converttokenstoids(tokens) print(tokenids) # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8] convert to tensor import torch tokentensor = torch.LongTensor([tokenids]) provide position ids explicitly positionids = list(range(0, tokentensor.size(1))) print(positionids) # output: [0, 1, 2, 3, 4, 5, 6, 7, 8] positionidtensor = torch.LongTensor([positionids]) get the top 10 predictions of the masked token with torch.nograd(): outputs = model(inputids=tokentensor, positionids=positionidtensor) predictions = outputs[0][0, maskedidx].topk(10) for i, indext in enumerate(predictions.indices): index = indext.item() token = tokenizer.convertidstotokens([index])[0] print(i, token) """ 0 総会 1 サミット 2 ワールドカップ 3 フェスティバル 4 大会 5 オリンピック 6 全国大会 7 党大会 8 イベント 9 世界選手権 """ ~~~~ Model architecture A 12-layer, 768-hidden-size transformer-based masked language model. Training The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective on 8V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100. Tokenization The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.
japanese-gpt-neox-3.6b
japanese-gpt2-medium
This repository provides a medium-sized Japanese GPT-2 model. The model was trained using code from Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd. ~~~~ from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt2-medium", usefast=False) tokenizer.dolowercase = True # due to some bug of tokenizer config loading model = AutoModelForCausalLM.frompretrained("rinna/japanese-gpt2-medium") ~~~~ Model architecture A 24-layer, 1024-hidden-size transformer-based language model. Training The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective on 8\\V100 GPUs for around 30 days. It reaches around 18 perplexity on a chosen validation set from the same data. Tokenization The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script. Release date April 7, 2021 (Updated: August 25, 2021)
japanese-wav2vec2-base
japanese-gpt-neox-3.6b-instruction-ppo
Overview This repository provides a Japanese GPT-NeoX model of 3.6 billion parameters. The model is based on `rinna/japanese-gpt-neox-3.6b-instruction-sft-v2` and has been aligned to serve as an instruction-following conversational agent. A 36-layer, 2816-hidden-size transformer-based language model. Following the OpenAI InstructGPT paper, Reinforcement Learning from Human Feedback (RLHF) has been applied to aligning the model's behaviour with input instructions. Particularly, the model has been trained in two stages, i.e. Supervised Fine-Tuning (SFT) and PPO-based Reinforcement Learning (RL). The first SFT stage produces `rinna/japanese-gpt-neox-3.6b-instruction-sft-v2`. The second RL stage produces this model. We conducted human evaluation and ChatGPT-based automated evaluation on 100 prompts to assess the performance gain from reinforcement learning. | PPO vs. SFT | win | tie | loss | | :---: | :---: | :---: | :---: | | Human evaluation | 47% | 30% | 23% | | ChatGPT auto. evaluation | 63% | 3% | 34% | We used CarperAI/trlx and its implementation of the PPO algorithm for the RL stage. The RL data is the subset of the following dataset and has been translated into Japanese. Anthropic HH RLHF data | Variant | Link | | :-- | :--| | 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo | | 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | | 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft | | 3.6B pretrained | https://huggingface.co/rinna/japanese-gpt-neox-3.6b | Limitations We found this verison of PPO model tends to generate repeated text more often than its SFT counterpart, and thus we set `repetitionpenalty=1.1` for better generation performance. (The same generation hyper-parameters are applied to the SFT model in aforementioned evaluation experiments.) You can also explore other hyperparameter combinations that yield higher generation randomness/diversity for better generation quality, e.g. `temperature=0.9, repetitionpenalty=1.0`. A special format has been adopted to construct inputs. An input prompt is formatted as a conversation between `ユーザー` and `システム`. Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"世界で一番高い山は?"`). The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response. Since the model's tokenizer does not recognize `"\n"`, a special newline symbol `" "` is used instead. All the newlines in input and output utterances should be replaced with `" "`. All the utterances in the input prompt should be separated by `" "`. Following is an example to construct an input from a conversation. ~~~python prompt = [ { "speaker": "ユーザー", "text": "コンタクトレンズを慣れるにはどうすればよいですか?" }, { "speaker": "システム", "text": "これについて具体的に説明していただけますか?何が難しいのでしょうか?" }, { "speaker": "ユーザー", "text": "目が痛いのです。" }, { "speaker": "システム", "text": "分かりました、コンタクトレンズをつけると目がかゆくなるということですね。思った以上にレンズを外す必要があるでしょうか?" }, { "speaker": "ユーザー", "text": "いえ、レンズは外しませんが、目が赤くなるんです。" } ] prompt = [ f"{uttr['speaker']}: {uttr['text']}" for uttr in prompt ] prompt = " ".join(prompt) prompt = ( prompt + " " + "システム: " ) print(prompt) "ユーザー: コンタクトレンズを慣れるにはどうすればよいですか? システム: これについて具体的に説明していただけますか?何が難しいのでしょうか? ユーザー: 目が痛いのです。 システム: 分かりました、コンタクトレンズをつけると目がかゆくなるということですね。思った以上にレンズを外す必要があるでしょうか? ユーザー: いえ、レンズは外しませんが、目が赤くなるんです。 システム: " ~~~ How to use the model ~~~~python import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt-neox-3.6b-instruction-ppo", usefast=False) model = AutoModelForCausalLM.frompretrained("rinna/japanese-gpt-neox-3.6b-instruction-ppo") if torch.cuda.isavailable(): model = model.to("cuda") tokenids = tokenizer.encode(prompt, addspecialtokens=False, returntensors="pt") with torch.nograd(): outputids = model.generate( tokenids.to(model.device), dosample=True, maxnewtokens=128, temperature=0.7, repetitionpenalty=1.1, padtokenid=tokenizer.padtokenid, bostokenid=tokenizer.bostokenid, eostokenid=tokenizer.eostokenid ) output = tokenizer.decode(outputids.tolist()[0][tokenids.size(1):]) output = output.replace(" ", "\n") print(output) """それは、コンタクトレンズが目に合わないために起こることがあります。レンズが目の表面に長時間触れ続けることが原因となることがあります。また、コンタクトレンズが汚れている可能性もあります。コンタクトレンズケースを定期的に洗浄したり、コンタクトレンズを正しくフィットさせるようにしたりすることが役立ちます。 """ ~~~~ Tokenization The model uses a sentencepiece-based tokenizer. The tokenizer has a vocabulary size of 32,000. It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF-8 byte pieces and to avoid producing ` ` tokens. sentencepiece's `--adddummyprefix` option was turned off so that a leading whitespace will not be prepended automatically. ~~~ print(tokenizer.tokenize("吾輩は猫である")) # ['吾', '輩', 'は', '猫', 'である'] # instead of ['▁', '吾', '輩', 'は', '猫', 'である'] as in rinna/japanese-gpt-1b ~~~ sentencepiece's `--removeextrawhitespaces` option was turned off so that leading, trailing, and duplicate whitespaces are reserved. ~~~ print(tokenizer.tokenize(" 吾輩は 猫である ")) # ['▁', '▁', '吾', '輩', 'は', '▁', '▁', '猫', 'である', '▁', '▁', '▁'] # instead of ['▁', '吾', '輩', 'は', '▁猫', 'である'] as in rinna/japanese-gpt-1b ~~~ Don't forget to set `usefast=False` to make the above features function correctly. ~~~ goodtokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt-neox-3.6b", usefast=False) badtokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt-neox-3.6b") print(goodtokenizer.decode(goodtokenizer.encode("გამარჯობა 吾輩は 猫である "))) # 'გამარჯობა 吾輩は 猫である ' print(badtokenizer.decode(badtokenizer.encode("გამარჯობა 吾輩は 猫である "))) # 'გამარ[UNK]ობა 吾輩は 猫である ' ~~~
bilingual-gpt-neox-4b
youri-7b
japanese-gpt-1b
This repository provides a 1.3B-parameter Japanese GPT model. The model was trained by rinna Co., Ltd. ~~~~ import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.frompretrained("rinna/japanese-gpt-1b", usefast=False) model = AutoModelForCausalLM.frompretrained("rinna/japanese-gpt-1b") if torch.cuda.isavailable(): model = model.to("cuda") text = "西田幾多郎は、" tokenids = tokenizer.encode(text, addspecialtokens=False, returntensors="pt") with torch.nograd(): outputids = model.generate( tokenids.to(model.device), maxlength=100, minlength=100, dosample=True, topk=500, topp=0.95, padtokenid=tokenizer.padtokenid, bostokenid=tokenizer.bostokenid, eostokenid=tokenizer.eostokenid, badwordsids=[[tokenizer.unktokenid]] ) output = tokenizer.decode(outputids.tolist()[0]) print(output) sample output: 西田幾多郎は、その主著の「善の研究」などで、人間の内面に自然とその根源があると指摘し、その根源的な性格は、この西田哲学を象徴しているとして、カントの「純粋理性批判」と「判断力批判」を対比して捉えます。それは、「人が理性的存在であるかぎりにおいて、人はその当人に固有な道徳的に自覚された善悪の基準を持っている」とするもので、この理性的な善悪の観念を否定するのがカントの ~~~~ Model architecture A 24-layer, 2048-hidden-size transformer-based language model. Training The model was trained on Japanese C4, Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data. Tokenization The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.