orai-nlp

11 models • 3 total models in database
Sort by:

ElhBERTeu

license:cc-by-4.0
2,969
3

Gemma-Kimu-9b-it

Gemma-Kimu-9B-Instruct v1.0 is an instruction large language model (LLM) tailored specifically for the Basque language built from Google's Gemma-2-9b foundational and Gemma-2-9b instruct models, The used approach decouples language adaptation from post-training alignment by first continually pre-training the foundational LLM on a modest amount of monolingual target-language data while anchoring on English replay, and then injecting instruction-following capabilities via delta-based weight merging from the instructed counterpart of the base LLM. We first continually pre-train the base LLM on monolingual data in Basque to improve its linguistic capacity. Then, instead of post-training from scratch, we merge the post-training delta into the language-adapted model via weight merging. This simple yet effective method allows us to transfer not only instruction-following capabilities, but also human preference alignment. Evaluations show that Gemma-Kimu-9b-it exhibits notable improvements over Gemma-2-9b-it in Basque in instruction following, safety, and linguistic correctness. Want to test this model in a real setting? Join the waitlist: PLAYGROUND For continual pre-training, we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English. ZelaiHandi ZelaiHandi dataset (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer). FineWeb dataset (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer) To evaluate the instruction-following capabilities of our models in Basque, we use the NoRobotsEU benchmark (Corral et al., 2025), a manually translated subset of the original NoRobots test set. It consists of 100 Basque instructions, each paired with its English counterpart, spanning 9 diverse categories. | Model | Instruct follow. EU | Instruct follow. EN | |------------------------------|---------------------|---------------------| | Gemma-2-2b-it | 7 | 71 | | Gemma-Kimu-2b-it | 48 | 60 | | Gemma-2-9b-it | 57 | 86 | | Gemma-Kimu-9b-it | 71 | 82 | Additional evaluation results across linguistic proficiency and safety are included in (Sarasua et al., 2025). Then, copy the following snippet, replace the content of the user message with your prompt and run it! This model is derived from Gemma 2 and is licensed under the Gemma License. Copyright © Google DeepMind. All Rights Reserved. This work is part of the BasqueLLM project, titled "bi-SLM: Optimization of Industrial Processes through Bilingual SLMs" (EXP: 2025-CIE4-000048-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). If you use this model please cite the following reference: - Ixak Sarasua ([email protected]) - Ander Corral ([email protected]) - Xabier Saralegi ([email protected])

NaNK
794
4

Gemma-Kimu-2b-it

Gemma-Kimu-2b-Instruct v1.0 is an instruction large language model (LLM) tailored specifically for the Basque language built from Google's Gemma-2-2b foundational and Gemma-2-2b instruct models, The used approach decouples language adaptation from post-training alignment by first continually pre-training the foundational LLM on a modest amount of monolingual target-language data while anchoring on English replay, and then injecting instruction-following capabilities via delta-based weight merging from the instructed counterpart of the base LLM. We first continually pre-train the base LLM on monolingual data in Basque to improve its linguistic capacity. Then, instead of post-training from scratch, we merge the post-training delta into the language-adapted model via weight merging. This simple yet effective method allows us to transfer not only instruction-following capabilities, but also human preference alignment. Evaluations show that Gemma-Kimu-2b-it exhibits notable improvements over Gemma-2-2b-it in Basque in instruction following, safety, and linguistic correctness. Want to test this model in a real setting? Join the waitlist: PLAYGROUND For continual pre-training, we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English. ZelaiHandi ZelaiHandi dataset (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer). FineWeb dataset (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer) To evaluate the instruction-following capabilities of our models in Basque, we use the NoRobotsEU benchmark (Corral et al., 2025), a manually translated subset of the original NoRobots test set. It consists of 100 Basque instructions, each paired with its English counterpart, spanning 9 diverse categories. | Model | Instruct follow. EU | Instruct follow. EN | |------------------------------|---------------------|---------------------| | Gemma-2-2b-it | 7 | 71 | | Gemma-Kimu-2b-it | 48 | 60 | | Gemma-2-9b-it | 57 | 86 | | Gemma-Kimu-9b-it | 71 | 82 | Additional evaluation results across linguistic proficiency and safety are included in (Sarasua et al., 2025). Then, copy the following snippet, replace the content of the user message with your prompt and run it! This model is derived from Gemma 2 and is licensed under the Gemma License. Copyright © Google DeepMind. All Rights Reserved. This work is part of the BasqueLLM project, titled "bi-SLM: Optimization of Industrial Processes through Bilingual SLMs" (EXP: 2025-CIE4-000048-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). If you use this model please cite the following reference: - Ixak Sarasua ([email protected]) - Ander Corral ([email protected]) - Xabier Saralegi ([email protected])

NaNK
384
3

Llama-eus-8B

Llama-eus-8B, a foundational sub-10 billion parameter LLM for Basque Llama-eus-8B v1.0 is a foundational large language model (LLM) adapted from Meta's Llama3.1-8B, tailored specifically for the Basque language. Through continual pretraining on a combination of the ZelaiHandi dataset, containing approximately 1.5 billion high-quality Basque tokens, and a selected subset of the FineWeb dataset, around 300 million tokens, Llama-eus-8B aims to enhance linguistic performance in Basque while maintaining general English capabilities. The original Meta Llama 3.1 collection of models was trained on 15 trillion tokens, with some multilingual content supporting seven additional languages besides English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. However, it has limitations for low-resource languages such as Basque, leading to poor performance in formal linguistic competence (correct use of grammar and vocabulary) and functional linguistic competence (the ability to understand and use language in real-world contexts). To address this, Meta-Llama-3.1-8B was used as the base for Llama-eus-8B, which underwent specialized pretraining to improve these competences in Basque. Evaluations show that Llama-eus-8B exhibits notable improvements over Meta-Llama-3.1-8B in Basque for tasks requiring linguistic competence, with minimal degradation in performance for English. 📕 Paper: Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque - Developed by: Orai NLP Technologies - Model type: Foundational LLM - License: Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. - Finetuned from model: Built with Llama (Llama3.1-8B) For continual pre-training (CPT), we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English. - ZelaiHandi (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer). - FineWeb (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer) Llama-eus-8B was trained within the 🤗 Transformers ecosystem, utilizing 🤗 Accelerate and DeepSpeed ZeRO for efficient large-scale model training. The process was conducted on the Hyperion system at the Donostia International Physics Center (DIPC), leveraging 8x NVIDIA A100 80GB SXM4 GPUs. The model was trained with a sequence length of 4096 tokens and an effective batch size of approximately 2 million tokens, over 4 epochs, resulting in a total of around 7.2 billion tokens processed. A cosine learning rate schedule was used, with a peak learning rate of 1e-4 and a warm-up phase comprising 10% of the total steps. All remaining hyperparameters followed the configurations established by Touvron et al. (2023). To evaluate our model, we created Basque versions of well-established English benchmarks by manually translating a selected subset of these datasets. This approach enabled us to rigorously assess Llama-eus-8B's performance in Basque and directly compare it with its performance in English, providing a comprehensive evaluation of the model's multilingual capabilities. - ARCHTeusample (Corral et al., 2024) [25-shot]: A subset of 250 samples manually translated to Basque from the ARC dataset (Clark et al., 2018). The corresponding 250 English samples are also provided. The ARC dataset consists of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. - WinograndeHTeusample (Corral et al., 2024) [5-shot]: A subset of 250 samples manually translated to Basque from the WinoGrande dataset (Sakaguchi et al., 2019). The corresponding 250 English samples are also provided. The Winogrande dataset is a collection of 44k problems, inspired by Winograd Schema Challenge, but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. - MMLUHTeusample (Corral et al., 2024) [5-shot]: A subset of 270 samples manually translated to Basque from the MMLU dataset (Hendrycks et al., 2020). The corresponding 250 English samples are also provided. The MMLU dataset is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. - HellaSwagHTeusample (Corral et al., 2024) [10-shot]: A subset of 250 samples manually translated to Basque from the HellaSwag dataset (Zellers et al., 2019). The corresponding 250 English samples are also provided. The HellaSwag dataset is a dataset for commonsense NLI. Additionally, we evaluated our model using a suite of already publicly available Basque Benchmarks: - BL2MP (Urbizu et al., 2024) [0-shot]: The BL2MP test set, designed to assess the grammatical knowledge of language Models in the Basque language, inspired by the BLiMP benchmark. - Belebele (Bandarkar et al.) [5-shot]: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. - X-StoryCloze (Lin et al.) [0-shot]: XStoryCloze consists of the professionally translated version of the English StoryCloze dataset to 10 non-English languages. It is a commonsense reasoning framework for evaluating story understanding, story generation, and script learning - BasqueGLUE (Urbizu et al.) [5-shot]: BasqueGLUE is a NLU benchmark for Basque, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. - EusProficiency (Etxaniz et al., 2024) [5-shot]: EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque. - EusReading (Etxaniz et al., 2024) [1-shot]: EusReading consists of 352 reading comprehension exercises sourced from the set of past EGA (C1 Basque certificate) exams from 1998 to 2008. - EusTrivia (Etxaniz et al., 2024) [5-shot]: EusTrivia consists of 1,715 trivia questions from multiple online sources. A significant portion of the questions focus specifically on the Basque Country, its language and culture. - EusExams (Etxaniz et al., 2024) [5-shot]: EusExams is a collection of tests designed to prepare individuals for Public Service examinations conducted by several Basque institutions, including the public health system Osakidetza, the Basque Government, the City Councils of Bilbao and Gasteiz, and the University of the Basque Country (UPV/EHU). For the evaluation, we compare our model against the Latxa models (Etxaniz et al., 2024) to assess its performance and effectiveness in Basque language tasks. Latxa is a family of large language models specifically developed for Basque, with parameter sizes ranging from 7 billion to 70 billion. As the only existing models adapted to Basque, Latxa provides a valuable baseline for our comparison. Additionally, we compare our model against Meta's Llama 3.1 models (Dubey et al., 2024), including the 8B and 70B versions. Meta-Llama-3.1-8B model serves as the base model for our continual pre-training, providing a baseline for evaluating the improvements achieved through our approach. Model evaluations were conducted with the LM Evaluation Harness library from Eleuther AI. We divide the evaluation into sub-10 billion parameter models and over-10 billion parameter models to better understand the performance differences across various model sizes. This distinction allows us for a fairer comparison of our model against both smaller and larger scale models. .tb td{ text-align: center; } .tb th { padding-top: .5714286em; background-color: #E5E7E9; } .tb-eu-l10 tr:nth-child(2) { border-bottom-width: 2px; border-color: black; } .tb-en-l10 tr:nth-child(2) { border-bottom-width: 2px; border-color: black; } /.tb-eu-g10 tr:nth-child(3) { background: #F1F3F4; }/ .tb-eu-g10 tr:nth-child(3) td:nth-child(6) { background: #ABEBC6; } .tb-eu-g10 tr:nth-child(3) td:nth-child(10) { background: #ABEBC6; } .tb-eu-g10 tr:nth-child(3) td:nth-child(10) { background: #ABEBC6; } .tb-eu-g10 tr:nth-child(3) td:nth-child(13) { background: #ABEBC6; } .tb-eu-g10 tr:nth-child(3) td:nth-child(14) { background: #ABEBC6; } .tb-eu-g10 tr:nth-child(3) td:nth-child(2) { background: #58D68D; } .tb-eu-g10 tr:nth-child(3) td:nth-child(5) { background: #58D68D; } .tb-eu-g10 tr:nth-child(3) td:nth-child(7) { background: #58D68D; } .tb-eu-g10 tr:nth-child(3) td:nth-child(9) { background: #58D68D; } .tb-eu-g10 tr:nth-child(3) td:nth-child(11) { background: #58D68D; } .tb-eu-g10 tr:nth-child(3) { border-bottom-width: 2px; border-color: black; } /.tb-en-g10 tr:nth-child(3) { background: #F1F3F4; }/ .tb-en-g10 tr:nth-child(3) td:nth-child(2) { background: #ABEBC6; } .tb-en-g10 tr:nth-child(3) td:nth-child(7) { background: #ABEBC6; } .tb-en-g10 tr:nth-child(3) td:nth-child(4) { background: #58D68D; } .tb-en-g10 tr:nth-child(3) td:nth-child(5) { background: #58D68D; } .tb-en-g10 tr:nth-child(3) td:nth-child(6) { background: #58D68D; } .tb-en-g10 tr:nth-child(3) td:nth-child(8) { background: #58D68D; } .tb-en-g10 tr:nth-child(3) { border-bottom-width: 2px; border-color: black; } Table 1 and Table 2 present the performance of sub-10 billion parameter models on both Basque and English benchhmarks. We compare our Llama-eus-8B model with the Basque model latxa-7b-v1.2. We also report the results for the base model Meta-Llama-3.1-8B. | Models | BL2MP | ARC | Winogrande | MMLU | HellaSwag | Belebele | X-StoryCloze | EusExams | EusProficiency | EusReading | EusTrivia | BasqueGLUE | Average | |------------------------------|-------|-------|------------|-------|-----------|----------|--------------|----------|----------------|------------|-----------|------------|---------| | latxa-7b-v1.2 | 89.33 | 54.80 | 65.60 | 34.44 | 61.20 | 37.33 | 65.45 | 33.82 | 30.26 | 26.99 | 42.16 | 52.56 | 49.50 | | Llama-eus-8B | 89.22 | 55.20 | 67.20 | 53.33 | 63.60 | 73.44 | 65.72 | 52.51 | 48.44 | 54.55 | 56.21 | 55.27 | 61.22 | | Meta-Llama-3.1-8B | 60.50 | 42.80 | 56.80 | 48.52 | 46.80 | 61.78 | 55.66 | 45.65 | 32.50 | 43.18 | 44.49 | 46.33 | 48.75 | Table 1: Performance on Basque test sets for sub-10 billion parameter models. Best performing model is highlighted in bold. Llama-eus-8B consistently outperforms the other two models across all test sets, with only a minor drop on BL2MP, achieving the highest average score of 61.22. This highlights the effectiveness of our continual pre-training strategy, which significantly enhances Basque performance compared to the base model Meta-Llama-3.1-8B. | Models | ARC | Winogrande | MMLU | HellaSwag | Belebele | X-StoryCloze | Average | |------------------------------|-------|------------|-------|-----------|----------|--------------|---------| | latxa-7b-v1.2 | 61.20 | 75.60 | 38.15 | 76.40 | 41.56 | 73.66 | 61.10 | | Llama-eus-8B | 67.60 | 78.40 | 62.59 | 86.40 | 84.67 | 78.49 | 76.36 | | Meta-Llama-3.1-8B | 69.20 | 82.00 | 66.67 | 86.40 | 87.44 | 78.23 | 78.32 | Table 2: Performance on English test sets for sub-10 billion parameter models. Best performing model is highlighted in bold. In English benchmarks, the Meta-Llama-3.1-8B model leads in most categories, showing strong overall performance. However, Llama-eus-8B performs notably well with only a 2 point decrease on average, highlighting effectiveness of performing continual pre-training with Basque and English data to avoid catastrophic forgetting. Table 3 and Table 4 present the performance our Llama-eus-8B model againts over-10 billion parameter models on both Basque and English benchhmarks. We compare our Llama-eus-8B model with the 13B and 70B versions of Latxa and with the 70B version of Meta's Llama 3.1. | Models | BL2MP | ARC | Winogrande | MMLU | HellaSwag | Belebele | X-StoryCloze | EusExams | EusProficiency | EusReading | EusTrivia | BasqueGLUE | Average | |-------------------------------|--------|-------|------------|--------|-----------|----------|--------------|----------|----------------|------------|-----------|------------|---------| | latxa-13b-v1.2 | 88.67 | 55.60 | 69.60 | 39.63 | 61.60 | 53.89 | 66.51 | 43.66 | 44.11 | 34.94 | 56.38 | 53.36 | 55.66 | | latxa-70b-v1.2 | 88.72 | 64.80 | 72.80 | 47.78 | 67.20 | 71.67 | 70.55 | 51.90 | 60.65 | 52.27 | 62.45 | 59.74 | 64.21 | | Llama-eus-8B | 89.22 | 55.20 | 67.20 | 53.33 | 63.60 | 73.44 | 65.72 | 52.51 | 48.44 | 54.55 | 56.21 | 55.27 | 61.22 | | Meta-Llama-3.1-70B | 67.89 | 67.20 | 70.00 | 63.70 | 63.60 | 87.67 | 65.98 | 64.62 | 44.86 | 72.44 | 60.23 | 63.50 | 65.97 | Table 3: Performance on Basque test sets for over-10 billion parameter models. Best performing model is highlighted in bold. Light green indicates that Llama-eus-8B surpasses the 13B model while dark green highlights that Llama-eus-8B outperforms both Basque-adapted systems (13B and 70B). Table 3 shows that Llama-eus-8B outperforms the Latxa-13B model and performs competitively with the Latxa-70B model across various Basque benchmarks. While the Latxa-70B model excels in several categories, particularly in Basque-specific tasks, Llama-eus-8B still achieves a high average score of 61.22 even with fewer parameters. This indicates that the trade-off between parameter size and performance is significant, with Llama-eus-8B providing strong performance without requiring the largest model size. | Models | ARC | Winogrande | MMLU | HellaSwag | Belebele | X-StoryCloze | Average | |-------------------------------|--------|------------|---------|-----------|----------|--------------|---------| | latxa-13b-v1.2 | 66.80 | 80.80 | 47.41 | 83.20 | 63.44 | 76.51 | 69.69 | | latxa-70b-v1.2 | 70.00 | 84.80 | 51.48 | 86.00 | 81.78 | 78.76 | 75.47 | | Llama-eus-8B | 67.60 | 78.40 | 62.59 | 86.40 | 84.67 | 78.49 | 76.36 | | Meta-Llama-3.1-70B | 78.40 | 85.60 | 72.22 | 92.00 | 94.44 | 81.01 | 83.95 | Table 4: Performance on English test sets for over-10 billion parameter models. Best performing model is highlighted in bold. Light green indicates that Llama-eus-8B surpasses the 13B model while dark green highlights that Llama-eus-8B outperforms both Basque-adapted systems (13B and 70B). Table 4 reveals that the Meta-Llama-3.1-70B model leads in English benchmarks, achieving the highest average score of 83.95. The larger parameter size of Meta-Llama-3.1-70B contributes to its superior performance across most English tasks. Llama-eus-8B competes closely with the larger Latxa models despite having fewer parameters. Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: 8x NVIDIA A100 80GB SXM4 - Hours used: 561.4 GPU hours - Hardware Provider: Donostia International Physics Center (DIPC) - Compute Region: Spain - Carbon Emitted: 97.01 kg C02 eq Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. This work is part of the BasqueLLM project, titled "First steps towards an artificial intelligence in Basque based on LLMs" (EXP: 2023-CIEN-000081-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). If you use Llama-eus-8B please cite the following reference: - Ander Corral ([email protected]) - Xabier Saralegi ([email protected])

NaNK
llama
101
10

ElhBERTeu-medium

license:cc-by-4.0
21
0

Llama-eus-8B-Magpie_mix

NaNK
llama
18
0

bert-medium-sw

license:cc-by-4.0
5
0

bert-base-sw

license:cc-by-4.0
3
0

Gemma-Kimu-2b-base

Gemma-Kimu-2b v1.0 is a continually pre-trained large language model (LLM) for the Basque language, built upon Google’s Gemma-2-2b foundational model. This model focuses solely on language adaptation, without instruction-following alignment, serving as the base model for subsequent instruct-tuned versions such as Gemma-Kimu-2b-it. We continually pre-train the Gemma-2-2b model on a combination of Basque monolingual data and English replay to improve Basque linguistic capabilities while preserving English performance. This phase enhances the model’s syntactic, lexical, and morphological competence in Basque and establishes a solid foundation for downstream instruction-tuned and task-specific models. Evaluations show that Gemma-Kimu-2b exhibits significant improvements over the original Gemma-2-2b in Basque language understanding, coherence, and text generation fluency. For continual pre-training, we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English. ZelaiHandi ZelaiHandi dataset (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer). FineWeb dataset (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer) This model is derived from Gemma 2 and is licensed under the Gemma License. Copyright © Google DeepMind. All Rights Reserved. This work is part of the BasqueLLM project, titled "bi-SLM: Optimization of Industrial Processes through Bilingual SLMs" (EXP: 2025-CIE4-000048-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). - Ixak Sarasua ([email protected]) - Ander Corral ([email protected]) - Xabier Saralegi ([email protected])

NaNK
3
0

ElhBERTeu-nerc

license:cc-by-nc-4.0
2
0

Gemma-Kimu-9b-base

Gemma-Kimu-9b v1.0 is a continually pre-trained large language model (LLM) for the Basque language, built upon Google’s Gemma-2-9b foundational model. This model focuses solely on language adaptation, without instruction-following alignment, serving as the base model for subsequent instruct-tuned versions such as Gemma-Kimu-9b-it. We continually pre-train the Gemma-2-9b model on a combination of Basque monolingual data and English replay to improve Basque linguistic capabilities while preserving English performance. This phase enhances the model’s syntactic, lexical, and morphological competence in Basque and establishes a solid foundation for downstream instruction-tuned and task-specific models. Evaluations show that Gemma-Kimu-9b exhibits significant improvements over the original Gemma-2-9b in Basque language understanding, coherence, and text generation fluency. For continual pre-training, we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English. ZelaiHandi ZelaiHandi dataset (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer). FineWeb dataset (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer) This model is derived from Gemma 2 and is licensed under the Gemma License. Copyright © Google DeepMind. All Rights Reserved. This work is part of the BasqueLLM project, titled "bi-SLM: Optimization of Industrial Processes through Bilingual SLMs" (EXP: 2025-CIE4-000048-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). - Ixak Sarasua ([email protected]) - Ander Corral ([email protected]) - Xabier Saralegi ([email protected])

NaNK
2
0