HooshvareLab
bert-fa-zwnj-base-ner
--- language: fa ---
bert-fa-base-uncased
ParsBERT (v2.0) A Transformer-based Model for Persian Language Understanding We reconstructed the vocabulary and fine-tuned the ParsBERT v1.1 on the new Persian corpora in order to provide some functionalities for using ParsBERT in other scopes! Please follow the ParsBERT repo for the latest information about previous and current models. ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than `3.9M` documents, `73M` sentences, and `1.3B` words. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page `scientific`, Chetor `lifestyle`, Eligasht `itinerary`, Digikala `digital magazine`, Ted Talks `general conversational`, Books `novels, storybooks, short stories from old to the contemporary era`). As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format. Goals Objective goals during training are as below (after 300k steps). ParsBERT v2.0 Model - HooshvareLab/bert-fa-base-uncased ParsBERT v2.0 Sentiment Analysis - HooshvareLab/bert-fa-base-uncased-sentiment-digikala - HooshvareLab/bert-fa-base-uncased-sentiment-snappfood - HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary - HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi ParsBERT v2.0 Text Classification - HooshvareLab/bert-fa-base-uncased-clf-digimag - HooshvareLab/bert-fa-base-uncased-clf-persiannews ParsBERT v2.0 NER - HooshvareLab/bert-fa-base-uncased-ner-peyma - HooshvareLab/bert-fa-base-uncased-ner-arman ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling. | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | DeepSentiPers | |:------------------------:|:-----------:|:-----------:|:-----:|:-------------:| | Digikala User Comments | 81.72 | 81.74 | 80.74 | - | | SnappFood User Comments | 87.98 | 88.12 | 87.87 | - | | SentiPers (Multi Class) | 71.31 | 71.11 | - | 69.33 | | SentiPers (Binary Class) | 92.42 | 92.13 | - | 91.98 | | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | |:-----------------:|:-----------:|:-----------:|:-----:| | Digikala Magazine | 93.65 | 93.59 | 90.72 | | Persian News | 97.44 | 97.19 | 95.79 | | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF | |:-------:|:-----------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:| | PEYMA | 93.40 | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - | | ARMAN | 99.84 | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 | Questions? Post a Github issue on the ParsBERT Issues repo.
bert-base-parsbert-uncased
ParsBERT: Transformer-based Model for Persian Language Understanding ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned) -----------------------:|:---------:|:-----:|:-------------:| | Digikala User Comments | 81.74 | 80.74 | - | | SnappFood User Comments | 88.12 | 87.87 | - | | SentiPers (Multi Class) | 71.11 | - | 69.33 | | SentiPers (Binary Class) | 92.13 | - | 91.98 | | Dataset | ParsBERT | mBERT | |:-----------------:|:--------:|:-----:| | Digikala Magazine | 93.59 | 90.72 | | Persian News | 97.19 | 95.79 | | Dataset | ParsBERT | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF | |:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:| | PEYMA | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - | | ARMAN | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 | If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference Please cite the following paper in your publication if you are using ParsBERT in your research: We hereby, express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank Hooshvare Research Group for facilitating dataset gathering and scraping online text resources. - Mehrdad Farahani: Linkedin, Twitter, Github - Mohammad Gharachorloo: Linkedin, Twitter, Github - Marzieh Farahani: Linkedin, Twitter, Github - Mohammad Manthouri: Linkedin, Twitter, Github - Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram Release v0.1 (May 27, 2019) This is the first version of our ParsBERT based on BERT BASE
bert-fa-zwnj-base
Language: fa License: Apache 2.0