MaLA-LM
emma-500-llama3.1-8b-bi
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data EMMA-500 Llama 3.1 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3.1 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification. - Project Website: https://mala-lm.github.io/emma-500-gen2.html - Paper: https://arxiv.org/abs/2506.00469 - Architecture: Built on Llama 3.1 8B with enhanced language adaptation through continual pre-training. - Languages: Supports 546 languages with substantial training data (over 100k tokens each). - Data Mix: A diverse bilingual mix of text from domains like code, books, instruction data, and papers. - Total Tokens: 671B EMMA-500 series - 🤗MaLA-LM/emma-500-llama2-7b: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - 🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3.1-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - MaLA monolingual corpus: 🤗MaLA-LM/mala-monolingual-split - MaLA bilingual translation corpus: 🤗MaLA-LM/mala-bilingual-translation-corpus - MaLA code and reasoning corpus: 🤗MaLA-LM/mala-code-reasoning-v2 You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model: Use Cases - Massively multilingual NLP tasks, e.g., machine translation - Performance regression on some tasks and high-resource languages - Cannot be used for real-world scenarios, esp. in high-stakes domains. If you find this model useful, please cite the paper below. Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).
emma-500-llama2-7b
emma-500-llama3-8b-mono
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data EMMA-500 Llama 3 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification. - Project Website: https://mala-lm.github.io/emma-500-gen2.html - Paper: https://arxiv.org/abs/2506.00469 - Architecture: Built on Llama 3 8B with enhanced language adaptation through continual pre-training. - Languages: Supports 546 languages with substantial training data (over 100k tokens each). - Data Mix: A diverse monolingual mix of text from domains like code, books, instruction data, and papers. - Total Tokens: 419B EMMA-500 series - 🤗MaLA-LM/emma-500-llama2-7b: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - 🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3.1-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - MaLA monolingual corpus: 🤗MaLA-LM/mala-monolingual-split - MaLA code and reasoning corpus: 🤗MaLA-LM/mala-code-reasoning-v2 You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model: Use Cases - Massively multilingual NLP tasks, e.g., machine translation - Performance regression on some tasks and high-resource languages - Cannot be used for real-world scenarios, esp. in high-stakes domains. If you find this model useful, please cite the paper below. Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).
emma-500-llama3.1-8b-mono
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data EMMA-500 Llama 3.1 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3.1 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification. - Project Website: https://mala-lm.github.io/emma-500-gen2.html - Paper: https://arxiv.org/abs/2506.00469 - Architecture: Built on Llama 3.1 8B with enhanced language adaptation through continual pre-training. - Languages: Supports 546 languages with substantial training data (over 100k tokens each). - Data Mix: A diverse monolingual mix of text from domains like code, books, instruction data, and papers. - Total Tokens: 419B EMMA-500 series - 🤗MaLA-LM/emma-500-llama2-7b: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - 🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3.1-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - MaLA monolingual corpus: 🤗MaLA-LM/mala-monolingual-split - MaLA code and reasoning corpus: 🤗MaLA-LM/mala-code-reasoning-v2 You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model: Use Cases - Massively multilingual NLP tasks, e.g., machine translation - Performance regression on some tasks and high-resource languages - Cannot be used for real-world scenarios, esp. in high-stakes domains. If you find this model useful, please cite the paper below. Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).
emma-500-llama3-8b-bi
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data EMMA-500 Llama 3 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification. - Project Website: https://mala-lm.github.io/emma-500-gen2.html - Paper: https://arxiv.org/abs/2506.00469 - Architecture: Built on Llama 3 8B with enhanced language adaptation through continual pre-training. - Languages: Supports 546 languages with substantial training data (over 100k tokens each). - Data Mix: A diverse bilingual mix of text from domains like code, books, instruction data, and papers. - Total Tokens: 671B EMMA-500 series - 🤗MaLA-LM/emma-500-llama2-7b: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - 🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in 500+ languages - 🤗MaLA-LM/emma-500-llama3.1-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs - MaLA monolingual corpus: 🤗MaLA-LM/mala-monolingual-split - MaLA bilingual translation corpus: 🤗MaLA-LM/mala-bilingual-translation-corpus - MaLA code and reasoning corpus: 🤗MaLA-LM/mala-code-reasoning-v2 You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model: Use Cases - Massively multilingual NLP tasks, e.g., machine translation - Performance regression on some tasks and high-resource languages - Cannot be used for real-world scenarios, esp. in high-stakes domains. If you find this model useful, please cite the paper below. Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).