KB
bert-base-swedish-cased-ner
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) fro...
bert-base-swedish-cased
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on. The following three models are currently available: - bert-base-swedish-cased (v1) - A BERT trained with the same hyperparameters as first published by Google. - bert-base-swedish-cased-ner (experimental) - a BERT fine-tuned for NER using SUC 3.0. - albert-base-swedish-cased-alpha (alpha) - A first attempt at an ALBERT for Swedish. All models are cased and trained with whole word masking. | name | files | |---------------------------------|-----------| | bert-base-swedish-cased | config, vocab, pytorchmodel.bin | | bert-base-swedish-cased-ner | config, vocab pytorchmodel.bin | | albert-base-swedish-cased-alpha | config, sentencepiece model, pytorchmodel.bin | The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `dolowercase` flag parameter set to `False` and `keepaccents` to `True` (for ALBERT). To create an environment where the examples can be run, run the following in an terminal on your OS of choice. A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows: This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings: Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change. The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this: Which should result in the following (though less cleanly formated): The easisest way to do this is, again, using Huggingface Transformers: - Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER. - Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). - Models are hosted on S3 by Huggingface 🤗