jweb

1 models • 1 total models in database

Sort by:

Japanese Soseki Gpt2 1b

This repository provides a 1.3B-parameter finetuned Japanese GPT2 model. The model was finetuned by jweb based on trained by rinna Co., Ltd. Both pytorch(pytorchmodel.bin) and Rust(rustmodel.ot) models are provided python ~~~~ import torch from transformers import T5Tokenizer, AutoModelForCausalLM tokenizer = T5Tokenizer.frompretrained("jweb/japanese-soseki-gpt2-1b") model = AutoModelForCausalLM.frompretrained("jweb/japanese-soseki-gpt2-1b") if torch.cuda.isavailable(): model = model.to("cuda") text = "夏目漱石は、" tokenids = tokenizer.encode(text, addspecialtokens=False, returntensors="pt") with torch.nograd(): outputids = model.generate( tokenids.to(model.device), maxlength=128, minlength=40, dosample=True, repetitionpenalty= 1.6, earlystopping= True, numbeams= 5, temperature= 1.0, topk=500, topp=0.95, padtokenid=tokenizer.padtokenid, bostokenid=tokenizer.bostokenid, eostokenid=tokenizer.eostokenid, ) output = tokenizer.decode(outputids.tolist()[0]) print(output) sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。 ~~~~ rust ~~~~ use rustbert::gpt2::GPT2Generator; use rustbert::pipelines::common::{ModelType, TokenizerOption}; use rustbert::pipelines::generationutils::{GenerateConfig, LanguageGenerator}; use rustbert::resources::{ RemoteResource, ResourceProvider}; use tch::Device; fn main() -> anyhow::Result { let modelresource = Box::new(RemoteResource { url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rustmodel.ot".into(), cachesubdir: "japanese-soseki-gpt2-1b/model".into(), }); let configresource = Box::new(RemoteResource { url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(), cachesubdir: "japanese-soseki-gpt2-1b/config".into(), }); let vocabresource = Box::new(RemoteResource { url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(), cachesubdir: "japanese-soseki-gpt2-1b/vocab".into(), }); let vocabresourcetoken = vocabresource.clone(); let mergesresource = vocabresource.clone(); let generateconfig = GenerateConfig { modelresource, configresource, vocabresource, mergesresource, // not used device: Device::Cpu, repetitionpenalty: 1.6, minlength: 40, maxlength: 128, dosample: true, earlystopping: true, numbeams: 5, temperature: 1.0, topk: 500, topp: 0.95, ..Default::default() }; let tokenizer = TokenizerOption::fromfile( ModelType::T5, vocabresourcetoken.getlocalpath().unwrap().tostr().unwrap(), None, true, None, None, )?; let mut gpt2model = GPT2Generator::newwithtokenizer(generateconfig, tokenizer.into())?; gpt2model.setdevice(Device::cudaifavailable()); let inputtext = "夏目漱石は、"; let t1 = std::time::Instant::now(); let output = gpt2model.generate(Some(&[inputtext]), None); println!("{}", output[0].text); println!("Elapsed Time(ms):{}",t1.elapsed().asmillis()); Ok(()) } // sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。 ~~~~ Model architecture A 24-layer, 2048-hidden-size transformer-based language model. Training The model was trained on Japanese C4, Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data. Finetuning The model was finetuned on Aozorabunko, especially Natume Soseki books. Tokenization The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols. Licenese The MIT license

NaNK

license:mit