vinai

28 models • 1 total models in database
Sort by:

bertweet-base

--- license: mit ---

license:mit
315,220
40

phobert-base

--- license: mit language: - vi ---

license:mit
287,711
64

phobert-base-v2

Table of contents 1. Introduction 2. Using PhoBERT with `transformers` - Installation - Pre-trained models - Example usage 3. Using PhoBERT with `fairseq` 4. Notes PhoBERT: Pre-trained language models for Vietnamese Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese (Pho, i.e. "Phở", is a popular food in Vietnam): - Two PhoBERT versions of "base" and "large" are the first public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT pre-training approach is based on RoBERTa which optimizes the BERT pre-training procedure for more robust performance. - PhoBERT outperforms previous monolingual and multilingual approaches, obtaining new state-of-the-art performances on four downstream Vietnamese NLP tasks of Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. The general architecture and experimental results of PhoBERT can be found in our paper: @inproceedings{phobert, title = {{PhoBERT: Pre-trained language models for Vietnamese}}, author = {Dat Quoc Nguyen and Anh Tuan Nguyen}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020}, year = {2020}, pages = {1037--1042} } Please CITE our paper when PhoBERT is used to help produce published results or is incorporated into other software. Installation - Install `transformers` with pip: `pip install transformers`, or install `transformers` from source. Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install `transformers` as follows: - Install `tokenizers` with pip: `pip3 install tokenizers` Model | #params | Arch. | Max length | Pre-training data ---|---|---|---|--- `vinai/phobert-base` | 135M | base | 256 | 20GB of Wikipedia and News texts `vinai/phobert-large` | 370M | large | 256 | 20GB of Wikipedia and News texts `vinai/phobert-base-v2` | 135M | base | 256 | 20GB of Wikipedia and News texts + 120GB of texts from OSCAR-2301 In case the input texts are `raw`, i.e. without word segmentation, a word segmenter must be applied to produce word-segmented texts before feeding to PhoBERT. As PhoBERT employed the RDRSegmenter from VnCoreNLP to pre-process the pre-training data (including Vietnamese tone normalization and word and sentence segmentation), it is recommended to also use the same word segmenter for PhoBERT-based downstream applications w.r.t. the input raw texts. This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see .

license:agpl-3.0
120,628
27

xphonebert-base

license:mit
22,940
17

phobert-large

license:mit
15,090
12

PhoWhisper-medium

license:bsd-3-clause
5,948
15

PhoWhisper-large

license:bsd-3-clause
4,049
34

bartpho-word

license:mit
3,769
6

bertweet-large

license:mit
3,502
13

bartpho-syllable

license:mit
3,341
7

PhoWhisper-base

license:bsd-3-clause
3,084
7

PhoWhisper-small

license:bsd-3-clause
2,340
9

PhoWhisper-tiny

license:bsd-3-clause
1,620
13

bartpho-syllable-base

license:mit
1,426
1

PhoGPT-4B-Chat

NaNK
license:bsd-3-clause
1,195
42

vinai-translate-vi2en-v2

license:agpl-3.0
1,136
6

PhoGPT 4B

We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20480 token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its superior performance compared to previous open-source models. More details about the general architecture and experimental results of PhoGPT can be found in our technical report: Please CITE our technical report when PhoGPT is used to help produce published results or is incorporated into other software. For further information or requests, please go to PhoGPT's homepage!

NaNK
license:bsd-3-clause
986
19

vinai-translate-en2vi-v2

license:agpl-3.0
639
11

bartpho-word-base

license:mit
402
3

PhoGPT-4B-Chat-gguf

NaNK
license:bsd-3-clause
370
10

vinai-translate-vi2en

license:agpl-3.0
302
11

vinai-translate-en2vi

license:agpl-3.0
224
4

bertweet-covid19-base-uncased

license:mit
81
2

PhoGPT-7B5

NaNK
license:bsd-3-clause
68
46

bertweet-covid19-base-cased

license:mit
32
2

RecGPT-7B-Instruct

NaNK
license:cc-by-nc-4.0
6
7

PhoGPT-7B5-Instruct

NaNK
license:bsd-3-clause
2
39

RecGPT-7B

NaNK
license:cc-by-nc-4.0
2
4