vinai

28 models • 1 total models in database

Sort by:

bertweet-base

--- license: mit ---

license:mit

315,220

phobert-base

--- license: mit language: - vi ---

license:mit

287,711

phobert-base-v2

Table of contents 1. Introduction 2. Using PhoBERT with `transformers` - Installation - Pre-trained models - Example usage 3. Using PhoBERT with `fairseq` 4. Notes PhoBERT: Pre-trained language models for Vietnamese Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam): - Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance. - PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. The general architecture and experimental results of PhoBERT can be found in our paper: @inproceedings{phobert, title = {{PhoBERT: Pre-trained language models for Vietnamese}}, author = {Dat Quoc Nguyen and Anh Tuan Nguyen}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020}, year = {2020}, pages = {1037--1042} } Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software. Installation - Install `transformers` with pip: `pip install transformers`, or install `transformers` from source. Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install `transformers` as follows: - Install `tokenizers` with pip: `pip3 install tokenizers` Model | #params | Arch. | Max length | Pre-training data ---|---|---|---|--- `vinai/phobert-base` | 135M | base | 256 | 20GB of Wikipedia and News texts `vinai/phobert-large` | 370M | large | 256 | 20GB of Wikipedia and News texts `vinai/phobert-base-v2` | 135M | base | 256 | 20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301 In case the input texts are `raw`, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts. This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see .

license:agpl-3.0

120,628

PhoGPT 4B

We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its superior performance compared to previous open-source models. More details about the general architecture and experimental results of PhoGPT can be found in our technical report: Please CITE our technical report when PhoGPT is used to help produce published results or is incorporated into other software. For further information or requests, please go to PhoGPT's homepage!

NaNK

license:bsd-3-clause

986