Voicelab

9 models • 1 total models in database

Sort by:

herbert-base-cased-sentiment

Overview - Language model: allegro/herbert-base-cased - Language: pl - Training data: Reviews + own data - Blog post: Sentiment analysis - COVID-19 – the source of the heated discussion

license:cc-by-4.0

78,858

vlt5-base-keywords

> Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google (https://huggingface.co/t5-base). The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract. Keywords generated with vlT5-base-keywords: encoder-decoder architecture, keyword generation Results on demo model (different generation method, one model per language): > Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google (https://huggingface.co/t5-base). The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract. Keywords generated with vlT5-base-keywords: encoder-decoder architecture, vlT5, keyword generation, scientific articles corpus The biggest advantage is the transferability of the vlT5 model, as it works well on all domains and types of text. The downside is that the text length and the number of keywords are similar to the training data: the text piece of an abstract length generates approximately 3 to 5 keywords. It works both extractive and abstractively. Longer pieces of text must be split into smaller chunks, and then propagated to the model. Overview - Language model: t5-base - Language: pl, en (but works relatively well with others) - Training data: POSMAC - Online Demo: Visit our online demo for better results https://nlp-demo-1.voicelab.ai/ - Paper: Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022 The model was trained on a POSMAC corpus. Polish Open Science Metadata Corpus (POSMAC) is a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. | Domains | Documents | With keywords | | -------------------------------------------------------- | --------: | :-----------: | | Engineering and technical sciences | 58 974 | 57 165 | | Social sciences | 58 166 | 41 799 | | Agricultural sciences | 29 811 | 15 492 | | Humanities | 22 755 | 11 497 | | Exact and natural sciences | 13 579 | 9 185 | | Humanities, Social sciences | 12 809 | 7 063 | | Medical and health sciences | 6 030 | 3 913 | | Medical and health sciences, Social sciences | 828 | 571 | | Humanities, Medical and health sciences, Social sciences | 601 | 455 | | Engineering and technical sciences, Humanities | 312 | 312 | As in the original plT5 implementation, the training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens. Inference Our results showed that the best generation results were achieved with `norepeatngramsize=3, numbeams=4` | Method | Rank | Micro | | | Macro | | | | ----------- | ---: | :--------: | ---------: | ---------: | :---: | ----: | ----: | | | | P | R | F1 | P | R | F1 | | extremeText | 1 | 0.175 | 0.038 | 0.063 | 0.007 | 0.004 | 0.005 | | | 3 | 0.117 | 0.077 | 0.093 | 0.011 | 0.011 | 0.011 | | | 5 | 0.090 | 0.099 | 0.094 | 0.013 | 0.016 | 0.015 | | | 10 | 0.060 | 0.131 | 0.082 | 0.015 | 0.025 | 0.019 | | vlT5kw | 1 | 0.345 | 0.076 | 0.124 | 0.054 | 0.047 | 0.050 | | | 3 | 0.328 | 0.212 | 0.257 | 0.133 | 0.127 | 0.129 | | | 5 | 0.318 | 0.237 | 0.271 | 0.143 | 0.140 | 0.141 | | KeyBERT | 1 | 0.030 | 0.007 | 0.011 | 0.004 | 0.003 | 0.003 | | | 3 | 0.015 | 0.010 | 0.012 | 0.006 | 0.004 | 0.005 | | | 5 | 0.011 | 0.012 | 0.011 | 0.006 | 0.005 | 0.005 | | TermoPL | 1 | 0.118 | 0.026 | 0.043 | 0.004 | 0.003 | 0.003 | | | 3 | 0.070 | 0.046 | 0.056 | 0.006 | 0.005 | 0.006 | | | 5 | 0.051 | 0.056 | 0.053 | 0.007 | 0.007 | 0.007 | | | all | 0.025 | 0.339 | 0.047 | 0.017 | 0.030 | 0.022 | | extremeText | 1 | 0.210 | 0.077 | 0.112 | 0.037 | 0.017 | 0.023 | | | 3 | 0.139 | 0.152 | 0.145 | 0.045 | 0.042 | 0.043 | | | 5 | 0.107 | 0.196 | 0.139 | 0.049 | 0.063 | 0.055 | | | 10 | 0.072 | 0.262 | 0.112 | 0.041 | 0.098 | 0.058 | | vlT5kw | 1 | 0.377 | 0.138 | 0.202 | 0.119 | 0.071 | 0.089 | | | 3 | 0.361 | 0.301 | 0.328 | 0.185 | 0.147 | 0.164 | | | 5 | 0.357 | 0.316 | 0.335 | 0.188 | 0.153 | 0.169 | | KeyBERT | 1 | 0.018 | 0.007 | 0.010 | 0.003 | 0.001 | 0.001 | | | 3 | 0.009 | 0.010 | 0.009 | 0.004 | 0.001 | 0.002 | | | 5 | 0.007 | 0.012 | 0.009 | 0.004 | 0.001 | 0.002 | | TermoPL | 1 | 0.076 | 0.028 | 0.041 | 0.002 | 0.001 | 0.001 | | | 3 | 0.046 | 0.051 | 0.048 | 0.003 | 0.001 | 0.002 | | | 5 | 0.033 | 0.061 | 0.043 | 0.003 | 0.001 | 0.002 | | | all | 0.021 | 0.457 | 0.040 | 0.004 | 0.008 | 0.005 | If you use this model, please cite the following paper: Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-342 Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022 The model was trained by NLP Research Team at Voicelab.ai.

license:cc-by-4.0

11,335

Voicelab

herbert-base-cased-sentiment

vlt5-base-keywords

trurl-2-13b

trurl-2-13b-academic

trurl-2-7b

sbert-large-cased-pl

sbert-base-cased-pl

trurl-2-7b-8bit

trurl-2-13b-8bit