vinai
bertweet-base
--- license: mit ---
phobert-base
--- license: mit language: - vi ---
phobert-base-v2
Table of contents 1. Introduction 2. Using PhoBERT with `transformers` - Installation - Pre-trained models - Example usage 3. Using PhoBERT with `fairseq` 4. Notes PhoBERT: Pre-trained language models for Vietnamese Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam): - Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance. - PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. The general architecture and experimental results of PhoBERT can be found in our paper: @inproceedings{phobert, title = {{PhoBERT: Pre-trained language models for Vietnamese}}, author = {Dat Quoc Nguyen and Anh Tuan Nguyen}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020}, year = {2020}, pages = {1037--1042} } Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software. Installation - Install `transformers` with pip: `pip install transformers`, or install `transformers` from source. Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install `transformers` as follows: - Install `tokenizers` with pip: `pip3 install tokenizers` Model | #params | Arch. | Max length | Pre-training data ---|---|---|---|--- `vinai/phobert-base` | 135M | base | 256 | 20GB of Wikipedia and News texts `vinai/phobert-large` | 370M | large | 256 | 20GB of Wikipedia and News texts `vinai/phobert-base-v2` | 135M | base | 256 | 20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301 In case the input texts are `raw`, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts. This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see .
xphonebert-base
phobert-large
PhoWhisper-medium
PhoWhisper-large
bartpho-word
bertweet-large
bartpho-syllable
PhoWhisper-base
PhoWhisper-small
PhoWhisper-tiny
bartpho-syllable-base
PhoGPT-4B-Chat
vinai-translate-vi2en-v2
PhoGPT 4B
We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its superior performance compared to previous open-source models. More details about the general architecture and experimental results of PhoGPT can be found in our technical report: Please CITE our technical report when PhoGPT is used to help produce published results or is incorporated into other software. For further information or requests, please go to PhoGPT's homepage!