kz-transformers

2 models • 1 total models in database

Sort by:

Kaz Roberta Conversational

[](https://doi.org/10.36227/techrxiv.175942902.25827042/v1) Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective. You can use this model directly with a pipeline for masked language modeling: The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets: - MDBKD Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains. - Conversational data Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group) Together these datasets weigh 25GB of text. Training procedure The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with ` ` and the end of one by ` ` The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, numattentionheads=12, numhiddenlayers=6. If you use Kaz-RoBERTa Conversational, please cite: Cite as: Beksultan Sagyndyk, Sanzhar Murzakhmetov, Kirill Yakunin. Kaz-RoBERTa Conversational Technical Report. TechRxiv. October 02, 2025. DOI: 10.36227/techrxiv.175942902.25827042/v1 BibTeX ```bibtex @misc{Sagyndyk2025KazRobertaConversational, title = {Kaz-RoBERTa Conversational Technical Report}, author = {Beksultan Sagyndyk and Sanzhar Murzakhmetov and Kirill Yakunin}, year = {2025}, publisher = {TechRxiv}, doi = {10.36227/techrxiv.175942902.25827042/v1}, url = {https://doi.org/10.36227/techrxiv.175942902.25827042/v1} }

license:apache-2.0

387

horde-vision

NaNK

license:apache-2.0