snunlp
KR-SBERT-V40K-klueNLI-augSTS
--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers language: - ko widget: - source_sentence: "그 식당은 파리를 날린다" sentences: - "그 식당에는 손님이 없다" - "그 식당에서는 드론을 날린다" - "파리가 식당에 날아다닌다" example_title: "Restaurant" - source_sentence: "잠이 옵니다" sentences: - "잠이 안 옵니다" - "졸음이 옵니다" - "기차가 옵니다" example_title: "Sleepy" ---
KR-ELECTRA-generator
This is a release of a Korean-specific ELECTRA model with comparable or better performances developed by the Computational Linguistics Lab at Seoul National University. Our model shows remarkable performances on tasks related to informal texts such as review documents, while still showing comparable results on other kinds of tasks. Released Model We pre-trained our KR-ELECTRA model following a base-scale model of ELECTRA. We trained the model based on Tensorflow-v1 using a v3-8 TPU of Google Cloud Platform. We followed the training parameters of the base-scale model of ELECTRA. | model | # of layers | embedding size | hidden size | # of heads | | ------: | ----------: | -------------: | ----------: | ---------: | | Discriminator | 12 | 768 | 768 | 12 | | Generator | 12 | 768 | 256 | 4 | | batch size | train steps | learning rates | max sequence length | generator size | | ---------: | ----------: | -------------: | ------------------: | -------------: | | 256 | 700000 | 2e-4 | 128 | 0.33333 | 34GB Korean texts including Wikipedia documents, news articles, legal texts, news comments, product reviews, and so on. These texts are balanced, consisting of the same ratios of written and spoken data. vocab size 30,000 We used morpheme-based unit tokens for our vocabulary based on the Mecab-Ko morpheme analyzer. We used and slightly edited the finetuning codes from KoELECTRA, with additionally adjusted hyperparameters. You can download the codes and config files that we used for our model from our github. | | NSMC (acc) | Naver NER (F1) | PAWS (acc) | KorNLI (acc) | KorSTS (spearman) | Question Pair (acc) | KorQuaD (Dev) (EM/F1) | Korean-Hate-Speech (Dev) (F1) | | :-------------------- | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: | :-----------------------------------: | | KoBERT | 89.59 | 87.92 | 81.25 | 79.62 | 81.59 | 94.85 | 51.75 / 79.15 | 66.21 | | XLM-Roberta-Base | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70 / 88.94 | 64.06 | | HanBERT | 90.06 | 87.70 | 82.95 | 80.32 | 82.73 | 94.72 | 78.74 / 92.02 | 68.32 | | KoELECTRA-Base | 90.33 | 87.18 | 81.70 | 80.64 | 82.00 | 93.54 | 60.86 / 89.28 | 66.09 | | KoELECTRA-Base-v2 | 89.56 | 87.16 | 80.70 | 80.72 | 82.30 | 94.85 | 84.01 / 92.40 | 67.45 | | KoELECTRA-Base-v3 | 90.63 | 88.11 | 84.45 | 82.24 | 85.53 | 95.25 | 84.83 / 93.45 | 67.61 | | KR-ELECTRA (ours) | 91.168 | 87.90 | 82.05 | 82.51 | 85.41 | 95.51 | 84.93 / 93.04 | 74.50 |
KR-FinBert-SC
Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement. we proposed KR-FinBert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. As many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment. The training data for this model is expanded from those of KR-BERT-MEDIUM, texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center and Korean Comments dataset. For the transfer learning, corporate related economic news articles from 72 media sources such as the Financial Times, The Korean Economy Daily, etc and analyst reports from 16 securities companies such as Kiwoom Securities, Samsung Securities, etc are added. Included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. The total data size is about 13.22GB. For mlm training, we split the data line by line and the total no. of lines is 6,379,315. KR-FinBert is trained for 5.5M steps with the maxlen of 512, training batch size of 32, and learning rate of 5e-5, taking 67.48 hours to train the model using NVIDIA TITAN XP. Downstream task performances with 50,000 labeled data. |Model|Accuracy| |-|-| |KR-FinBert|0.963| |KR-BERT-MEDIUM|0.958| |KcBert-large|0.955| |KcBert-base|0.953| |KoBert|0.817| |Positive|Negative| |-|-| |현대바이오, '폴리탁셀' 코로나19 치료 가능성에 19% 급등 | 영화관株 '코로나 빙하기' 언제 끝나나…"CJ CGV 올 4000억 손실 날수도" | |이수화학, 3분기 영업익 176억…전년比 80%↑ | C쇼크에 멈춘 흑자비행…대한항공 1분기 영업적자 566억 | |"GKL, 7년 만에 두 자릿수 매출성장 예상" | '1000억대 횡령·배임' 최신원 회장 구속… SK네트웍스 "경영 공백 방지 최선" | |위지윅스튜디오, 콘텐츠 활약에 사상 첫 매출 1000억원 돌파 | 부품 공급 차질에…기아차 광주공장 전면 가동 중단 | |삼성전자, 2년 만에 인도 스마트폰 시장 점유율 1위 '왕좌 탈환' | 현대제철, 지난해 영업익 3,313억원···전년比 67.7% 감소 |