lapa-llm
Lapa 12b Pt
> NOTE: THIS MODEL HAS TO BE USED WITH BOSTOKEN AT THE START, OTHERWISE PERFORMANCE WOULD BE HEAVILY DEGRADED! We proudly present Lapa LLM — a cutting-edge open large language model based on Gemma-3-12B with a focus on Ukrainian language processing. The project is the result of many months of work by a team of Ukrainian researchers in artificial intelligence from the Ukrainian Catholic University, AGH University of Krakow, Igor Sikorsky Kyiv Polytechnic Institute, and Lviv Polytechnic, who united to create the best model for Ukrainian language processing. We proudly present Lapa LLM — a cutting-edge open large language model based on Gemma-3-12B with a focus on Ukrainian language processing. The project is the result of many months of work by a team of Ukrainian researchers in artificial intelligence from the Ukrainian Catholic University, AGH University of Krakow, Igor Sikorsky Kyiv Polytechnic Institute, and Lviv Polytechnic, who united to create the best model for Ukrainian language processing. Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results. Most efficient instruction-tuned model on the market Our instruction version of the model in some benchmark categories is only slightly behind the current leader — MamayLM. The team is actively working on new datasets to further improve benchmark scores, which we plan to surpass in the v1.0 model. - Best English-to-Ukrainian translator with a result of 33 BLEU on FLORES and vice versa, which allows for natural and cost-effective translation of new NLP datasets into Ukrainian - One of the best models for image processing in Ukrainian in its size class, as measured on the MMZNO benchmark - One of the best models for Summarization and Q&A, which means excellent performance for RAG - Tests on propaganda and disinformation questions show the effectiveness of the filtering approach at the pretraining stage and during instruction fine-tuning Model measurements and comparisons will be published as part of the Ukrainian LLM Leaderboard project; subscribe to the Telegram channel for further news. Lapa LLM demonstrates the best performance in pretraining benchmarks for Ukrainian language processing, which opens opportunities for use by other researchers to adapt for their own tasks. The model was trained on data evaluated by various quality assessment models - evaluation of propaganda and disinformation presence, readability, grammar assessment, etc. In the final stages of training, the model was trained on high-quality materials provided for commercial use by the Open Data division of Harvard Library. Unlike most available models, Lapa LLM is a maximally open project: - The model is available for commercial use - Approximately 25 datasets for model training have been published - Methods for filtering and processing data are disclosed, including for detecting disinformation and propaganda - Open source code for the model - Documentation of the training process is available This openness allows for the development of the Ukrainian NLP community and helps businesses obtain a tool for the most efficient Ukrainian language processing in terms of both computation and results. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Below, there are some code snippets on how to get quickly started with running the model.First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. Then, copy the snippet from the section that is relevant for your use case. Data used for model training and how the data was processed. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
lapa-v0.1.2-instruct-GGUF
This repository contains a quantized versions of https://huggingface.co/lapa-llm/lapa-v0.1.2-instruct
Lapa V0.1.2 Instruct
Introducing Lapa LLM v0.1.2 — the most efficient Ukrainian open-source language model > Demo page: https://huggingface.co/spaces/lapa-llm/lapa > Link to Lapa Models: https://huggingface.co/collections/lapa-llm/lapa-v012-release > Quantized versions: https://huggingface.co/lapa-llm/lapa-v0.1.2-instruct-GGUF Today, we proudly present Lapa LLM — a cutting-edge open large language model based on Gemma-3-12B with a focus on Ukrainian language processing. The project is the result of many months of work by a team of Ukrainian researchers in artificial intelligence from the Ukrainian Catholic University, AGH University of Krakow, Igor Sikorsky Kyiv Polytechnic Institute, and Lviv Polytechnic, who united to create the best model for Ukrainian language processing. The model is named in honor of Valentyn Lapa, who together with Oleksiy Ivakhnenko created the Group Method of Data Handling, which is a predecessor to Deep Learning (source). The project's goal is to create the best model for Ukrainian language processing with open datasets for pretraining and instruction tuning. Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results. Most efficient instruction-tuned model on the market Our instruction version of the model in some benchmark categories is only slightly behind the current leader — MamayLM. The team is actively working on new datasets to further improve benchmark scores, which we plan to surpass in the v1.0 model. - Best English-to-Ukrainian translator with a result of 33 BLEU on FLORES and vice versa, which allows for natural and cost-effective translation of new NLP datasets into Ukrainian - One of the best models for image processing in Ukrainian in its size class, as measured on the MMZNO benchmark - One of the best models for Summarization and Q&A, which means excellent performance for RAG - Tests on propaganda and disinformation questions show the effectiveness of the filtering approach at the pretraining stage and during instruction fine-tuning Model measurements and comparisons will be published as part of the Ukrainian LLM Leaderboard project; subscribe to the Telegram channel for further news. Lapa LLM demonstrates the best performance in pretraining benchmarks for Ukrainian language processing, which opens opportunities for use by other researchers to adapt for their own tasks. The model was trained on data evaluated by various quality assessment models - evaluation of propaganda and disinformation presence, readability, grammar assessment, etc. In the final stages of training, the model was trained on high-quality materials provided for commercial use by the Open Data division of Harvard Library. Unlike most available models, Lapa LLM is a maximally open project: - The model is available for commercial use - Approximately 25 datasets for model training have been published - Methods for filtering and processing data are disclosed, including for detecting disinformation and propaganda - Open source code for the model - Documentation of the training process is available This openness allows for the development of the Ukrainian NLP community and helps businesses obtain a tool for the most efficient Ukrainian language processing in terms of both computation and results. Lapa LLM opens wide possibilities for: - Processing sensitive documents without transferring data to external servers - Working with Ukrainian texts taking into account cultural and historical context without code-switching to Russian or other languages - Building RAG systems and chatbots that write in proper Ukrainian - Developing specialized solutions through the ability to fine-tune for specific tasks - Machine translation with the best translation quality from English to Ukrainian and vice versa among all models, including API providers - Complete development of the reasoning model - We are collecting community feedback on the model's performance, so we look forward to receiving it on GitHub or HuggingFace! - Collecting additional datasets for image processing in Ukrainian - Collecting additional datasets for instruction following and programming The creation of Lapa LLM was made possible thanks to the support of our partners and sponsors, primarily the startup Comand.AI, which provided computational resources for training the model. We also want to thank the company ELEKS, which supported this project through a grant dedicated to the memory of Oleksiy Skrypnyk, and the startup HuggingFace, which provided a free corporate subscription to the team for storing models and datasets. Try the model: https://huggingface.co/spaces/lapa-llm/lapa Code: https://github.com/lapa-llm/lapa-llm Subscribe to the Telegram channel for further news about the project: https://t.me/pehadeblog - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. Then, copy the snippet from the section that is relevant for your use case. You can initialize the model and processor for inference with `pipeline` as follows. With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.
gec-score-model
manipulative-score-model
fasttext-quality-score
fineweb-nemotron-edu-score
fineweb-mixtral-edu-score
alignment-score-model
tokenizer
Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus. By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens. How to possible More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned. Replaced tokens |Writing system|Tokens removed|Tokens retained| |-|-|-| |Han (Chinese)|16,488|4,122| |Devanagari (Hindi)|10,976|2,743| |Bengali|7,983|1,995| |Arabic|6,730|1,682| |Hiragana / Katakana (Japanese)|3,944|985| |Hangul (Korean)|3,744|935| |Tamil|3,080|770| |Thai|1,740|435| |Malayalam|1,566|391| |Telugu|1,428|356| |Gujarati|1,080|270| |Kannada|1,016|253| |Ethiopic|691|172| |Hebrew|670|167| |Khmer|481|119| |Sinhala|435|108| |Myanmar|410|102| |Lao|243|60| |Gurmukhi|215|53| |Tibetan|107|26| |Oriya|100|25| |Cyrillic|13,398|0| |Gemma-3 \ |6,139|102| 1. +81,492 new Cyrillic BPE tokens trained on the full Kobza corpus plus the Cyrillic slice of the Crimean Tatar corpus. 2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected. 3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings. 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one. 5. Reasoning tokens "fixed" - means that we remove condition that allow to add empty ` ` for hybrid approach. This significantly speeds up tokenization.