yanolja
YanoljaNEXT-EEVE-Instruct-10.8B
YanoljaNEXT-Rosetta-4B-2510
This model is a fine-tuned version of `google/gemma-3-4b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Unlike our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-4B-2510` - Base Model: `google/gemma-3-4b-pt` This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages equally: - Arabic - Bulgarian - Chinese - Czech - Danish - Dutch - English - Finnish - French - German - Greek - Gujarati - Hebrew - Hindi - Hungarian - Indonesian - Italian - Japanese - Korean - Persian - Polish - Portuguese - Romanian - Russian - Slovak - Spanish - Swedish - Tagalog - Thai - Turkish - Ukrainian - Vietnamese While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. You can use this model with the `transformers` library as follows: The model outputs the final translation in JSON format when appropriate, or plain text for simple translations. Training Data The translation datasets were synthesized using fineweb corpora. - FineWeb Edu - FineWeb2 The model was fine-tuned on synthetic multilingual translation data to optimize performance across the supported language pairs. The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | google/gemini-2.5-flash-lite | 35.23 | | yanolja/YanoljaNEXT-Rosetta-4B-2510 | 35.09 | | yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | yanolja/YanoljaNEXT-Rosetta-4B | 31.31 | | openai/gpt-4.1-nano | 31.15 | | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | 31.02 | | openai/gpt-oss-20b | 30.56 | | google/gemma-3-27b-it | 30.05 | | google/gemma-3-4b-pt | 27.53 | YanoljaNEXT-Rosetta-4B-2510 achieves competitive translation quality while maintaining the efficiency of a 4B parameter model. Scores for the other language pairs can be found in the WMT24++ Evaluation Results. This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model is primarily optimized for processing JSON data. Its performance on unstructured text or other data formats may vary. In some cases, the model may produce invalid JSON, repetitive output, or inaccurate translations. License This model is released under the Gemma license, inherited from its base model, `google/gemma-3-4b-pt`. Please consult the official Gemma license terms for detailed usage guidelines. Acknowledgments This work was supported by the Korea Creative Content Agency (KOCCA) grant, funded by the Ministry of Culture, Sports and Tourism (MCST) in 2025 (Project Name: Cultivating Masters and Doctoral Experts to Lead Digital-Tech Tourism, Project Number: RS-2024-00442006, Contribution Rate: 100%). This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.
YanoljaNEXT-EEVE-Instruct-2.8B
YanoljaNEXT-Rosetta-4B-2510-GGUF
This model is a fine-tuned version of `google/gemma-3-4b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Unlike our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-4B-2510` - Base Model: `google/gemma-3-4b-pt` This folder contains ready-to-run GGUF files for llama.cpp. - `BF16/YanoljaNEXT-Rosetta-4B-2510-bf16.gguf`: full-precision reference model - Quantized variants (choose one based on your device and quality needs): - K-family: `Q3K{S,L}`, `Q5K{S,M}`, `Q6K`, `Q80` - IQ-family: `IQ2{S,M}`, `IQ3{XXS,XS,S}`, `IQ4{XS}` - For many types there are matching `IMX` folders. Files there were produced with an activation matrix (`imatrix.gguf`) and usually offer better quality at the same size. In this release, `IQ2{S,M}` and `IQ3{XXS,XS}` are IMX-only. This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages equally: - Arabic - Bulgarian - Chinese - Czech - Danish - Dutch - English - Finnish - French - German - Greek - Gujarati - Hebrew - Hindi - Hungarian - Indonesian - Italian - Japanese - Korean - Persian - Polish - Portuguese - Romanian - Russian - Slovak - Spanish - Swedish - Tagalog - Thai - Turkish - Ukrainian - Vietnamese While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. Use a recent build of `llama.cpp` that supports Gemma 3 models. Pick any GGUF file from this folder (a quantized variant is recommended for most users). The model is optimized to output structured JSON for translations when appropriate. Import any of the `.gguf` files into your GUI of choice (LM Studio, KoboldCPP, text-generation-webui) and select chat mode. The embedded template in the GGUF will be used automatically by recent tools. Training Data The translation datasets were synthesized using fineweb corpora. - FineWeb Edu - FineWeb2 The model was fine-tuned on synthetic multilingual translation data to optimize performance across the supported language pairs. The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | google/gemini-2.5-flash-lite | 35.23 | | yanolja/YanoljaNEXT-Rosetta-4B-2510 | 35.09 | | yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | yanolja/YanoljaNEXT-Rosetta-4B | 31.31 | | openai/gpt-4.1-nano | 31.15 | | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | 31.02 | | openai/gpt-oss-20b | 30.56 | | google/gemma-3-27b-it | 30.05 | | google/gemma-3-4b-pt | 27.53 | YanoljaNEXT-Rosetta-4B-2510 achieves competitive translation quality while maintaining the efficiency of a 4B parameter model. Scores for the other language pairs can be found in the WMT24++ Evaluation Results. This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model is primarily optimized for processing JSON data. Its performance on unstructured text or other data formats may vary. In some cases, the model may produce invalid JSON, repetitive output, or inaccurate translations. License This model is released under the Gemma license, inherited from its base model, `google/gemma-3-4b-pt`. Please consult the official Gemma license terms for detailed usage guidelines. Acknowledgments This work was supported by the Korea Creative Content Agency (KOCCA) grant, funded by the Ministry of Culture, Sports and Tourism (MCST) in 2025 (Project Name: Cultivating Masters and Doctoral Experts to Lead Digital-Tech Tourism, Project Number: RS-2024-00442006, Contribution Rate: 100%). This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.
YanoljaNEXT-Rosetta-12B-2510-GGUF
This model is a fine-tuned version of `google/gemma-3-12b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Unlike our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-12B-2510` - Base Model: `google/gemma-3-12b-pt` This folder contains ready-to-run GGUF files for llama.cpp. - `BF16/YanoljaNEXT-Rosetta-12B-2510-bf16.gguf`: full-precision reference model - Quantized variants (choose one based on your device and quality needs): - K-family: `Q2K`, `Q2KS`, `Q3K{S,M}`, `Q4K{S,M}`, `Q5K{S,M}`, `Q6K`, `Q80` - IQ-family: `IQ1{S,M}`, `IQ2{XXS,XS,S,M}`, `IQ3{XXS,XS,S,M}`, `IQ4{XS,NL}` - `IMX` variants are produced with an activation matrix (`imatrix.gguf`) and often offer better quality at the same size. In this release, `Q2K{,S}`, `IQ1`, and all `IQ2` are IMX-only; for `IQ3`, `XXS` and `XS` are IMX-only. This model is a 12-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages equally: - Arabic - Bulgarian - Chinese - Czech - Danish - Dutch - English - Finnish - French - German - Greek - Gujarati - Hebrew - Hindi - Hungarian - Indonesian - Italian - Japanese - Korean - Persian - Polish - Portuguese - Romanian - Russian - Slovak - Spanish - Swedish - Tagalog - Thai - Turkish - Ukrainian - Vietnamese While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. Use a recent build of `llama.cpp` that supports Gemma 3 models. Pick any GGUF file from this folder (a quantized variant is recommended for most users). The model is optimized to output structured JSON for translations when appropriate. Import any of the `.gguf` files into your GUI of choice (LM Studio, KoboldCPP, text-generation-webui) and select chat mode. The embedded template in the GGUF will be used automatically by recent tools. Training Data The translation datasets were synthesized using fineweb corpora. - FineWeb Edu - FineWeb2 The model was fine-tuned on synthetic multilingual translation data to optimize performance across the supported language pairs. The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | yanolja/YanoljaNEXT-Rosetta-12B-2510 | 37.36 | | openai/gpt-4o | 36.08 | | google/gemini-2.5-flash | 35.25 | | yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | google/gemma-3-27b-it | 30.05 | | google/gemma-3-12b-pt | 29.31 | YanoljaNEXT-Rosetta-12B-2510 achieves competitive translation quality while maintaining the efficiency of a 12B parameter model. Scores for the other language pairs can be found in the WMT24++ Evaluation Results. This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model is primarily optimized for processing JSON data. Its performance on unstructured text or other data formats may vary. In some cases, the model may produce invalid JSON, repetitive output, or inaccurate translations. License This model is released under the Gemma license, inherited from its base model, `google/gemma-3-12b-pt`. Please consult the official Gemma license terms for detailed usage guidelines. Acknowledgments This work was supported by the Korea Creative Content Agency (KOCCA) grant, funded by the Ministry of Culture, Sports and Tourism (MCST) in 2025 (Project Name: Cultivating Masters and Doctoral Experts to Lead Digital-Tech Tourism, Project Number: RS-2024-00442006, Contribution Rate: 100%). This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.
YanoljaNEXT-EEVE-2.8B
If you're passionate about the field of Large Language Models and wish to exchange knowledge and insights, we warmly invite you to join our Discord server. It's worth noting that Korean is the primary language used in this server. The landscape of LLM is evolving rapidly, and without active sharing, our collective knowledge risks becoming outdated swiftly. Let's collaborate and drive greater impact together! Join us here: Discord Link. Our Dedicated Team (Alphabetical Order) | Research | Engineering | Product Management | UX Design | |-----------------|-----------------|--------------------|-------------- | Myeongho Jeong | Geon Kim | Bokyung Huh | Eunsue Choi | | Seungduk Kim | Rifqi Alfi | | | | Seungtaek Choi | Sanghoon Han | | | | | Suhyun Kang | | | This model is a Korean vocabulary-extended version of microsoft/phi-2, specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lmhead` embeddings for the already existing tokens while preserving the original parameters of the base model. To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing. This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean. Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training. We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process. For more details, please refer to our technical report: Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications. Our model’s training was comprehensive and diverse: - Vocabulary Expansion: We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model. 1. Initial Tokenizer Training: We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens. 2. Extraction of New Korean Tokens: From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer. 3. Manual Tokenizer Construction: We then built the target tokenizer, focusing on these new Korean tokens. 4. Frequency Analysis: Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency. 5. Refinement of Token List: We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later. 6. Inclusion of Single-Letter Characters: Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times. 7. Iterative Refinement: We repeated steps 2 to 6 until there were no tokens to drop or add. 8. Training Bias Towards New Tokens: Our training data was biased to include more texts with new tokens, for effective learning. This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.
YanoljaNEXT Rosetta 12B 2510
This model is a fine-tuned version of `google/gemma-3-12b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Unlike our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-12B-2510` - Base Model: `google/gemma-3-12b-pt` This model is a 12-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages equally: - Arabic - Bulgarian - Chinese - Czech - Danish - Dutch - English - Finnish - French - German - Greek - Gujarati - Hebrew - Hindi - Hungarian - Indonesian - Italian - Japanese - Korean - Persian - Polish - Portuguese - Romanian - Russian - Slovak - Spanish - Swedish - Tagalog - Thai - Turkish - Ukrainian - Vietnamese While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. You can use this model with the `transformers` library as follows: The model outputs the final translation in JSON format when appropriate, or plain text for simple translations. Training Data The translation datasets were synthesized using fineweb corpora. - FineWeb Edu - FineWeb2 The model was fine-tuned on synthetic multilingual translation data to optimize performance across the supported language pairs. The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | yanolja/YanoljaNEXT-Rosetta-12B-2510 | 37.36 | | openai/gpt-4o | 36.08 | | google/gemini-2.5-flash | 35.25 | | yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | google/gemma-3-27b-it | 30.05 | | google/gemma-3-12b-pt | 29.31 | YanoljaNEXT-Rosetta-12B-2510 achieves competitive translation quality while maintaining the efficiency of a 12B parameter model. Scores for the other language pairs can be found in the WMT24++ Evaluation Results. This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model is primarily optimized for processing JSON data. Its performance on unstructured text or other data formats may vary. In some cases, the model may produce invalid JSON, repetitive output, or inaccurate translations. License This model is released under the Gemma license, inherited from its base model, `google/gemma-3-12b-pt`. Please consult the official Gemma license terms for detailed usage guidelines. Acknowledgments This work was supported by the Korea Creative Content Agency (KOCCA) grant, funded by the Ministry of Culture, Sports and Tourism (MCST) in 2025 (Project Name: Cultivating Masters and Doctoral Experts to Lead Digital-Tech Tourism, Project Number: RS-2024-00442006, Contribution Rate: 100%). This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.
YanoljaNEXT EEVE 10.8B
If you're passionate about the field of Large Language Models and wish to exchange knowledge and insights, we warmly invite you to join our Discord server. It's worth noting that Korean is the primary language used in this server. The landscape of LLM is evolving rapidly, and without active sharing, our collective knowledge risks becoming outdated swiftly. Let's collaborate and drive greater impact together! Join us here: Discord Link. Our Dedicated Team (Alphabetical Order) | Research | Engineering | Product Management | UX Design | |-----------------|-----------------|--------------------|-------------- | Myeongho Jeong | Geon Kim | Bokyung Huh | Eunsue Choi | | Seungduk Kim | Rifqi Alfi | | | | Seungtaek Choi | Sanghoon Han | | | | | Suhyun Kang | | | This model is a Korean vocabulary-extended version of upstage/SOLAR-10.7B-v1.0, specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lmhead` embeddings for the already existing tokens while preserving the original parameters of the base model. To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing. This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean. Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training. We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process. For more details, please refer to our technical report: Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications. Our model’s training was comprehensive and diverse: - Vocabulary Expansion: We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model. 1. Initial Tokenizer Training: We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens. 2. Extraction of New Korean Tokens: From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer. 3. Manual Tokenizer Construction: We then built the target tokenizer, focusing on these new Korean tokens. 4. Frequency Analysis: Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency. 5. Refinement of Token List: We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later. 6. Inclusion of Single-Letter Characters: Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times. 7. Iterative Refinement: We repeated steps 2 to 6 until there were no tokens to drop or add. 8. Training Bias Towards New Tokens: Our training data was biased to include more texts with new tokens, for effective learning. This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.
YanoljaNEXT-Rosetta-27B-2511-GGUF
YanoljaNEXT Rosetta 4B
This model is a fine-tuned version of `google/gemma-3-4b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Different from our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-4B` - Base Model: `google/gemma-3-4b-pt` This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages: - English - Spanish - French - German - Portuguese - Japanese - Korean - Chinese - Arabic - Russian - Hindi While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. You can use this model with the `transformers` library as follows: The model outputs the final translation in JSON format when appropriate, or plain text for simple translations. Training Data The translation datasets were compiled from several sources, including: - AI Hub - Europarl The model was fine-tuned on multilingual translation data to optimize performance across the supported language pairs. | Language | Portion (%) | Language | Portion (%) | |----------|-------------|----------|-------------| | Korean | 24.2 | French | 2.8 | | English | 16.2 | German | 2.5 | | Japanese | 5.8 | Russian | 2.4 | | Italian | 5.3 | Arabic | 2.3 | | Chinese | 4.4 | Other | 30.2 | | Spanish | 3.9 | | | The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | yanolja/YanoljaNEXT-Rosetta-12B | 34.75 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | yanolja/YanoljaNEXT-Rosetta-4B | 31.31 | | openai/gpt-4.1-nano | 31.15 | | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | 31.02 | | openai/gpt-oss-20b | 30.56 | | google/gemma-3-27b-it | 30.05 | | google/gemma-3-4b-pt | 27.53 | YanoljaNEXT-Rosetta-4B achieves competitive translation quality while maintaining the efficiency of a 4B parameter model. This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model's primary focus is on JSON data. Performance on unstructured text or other data formats may vary. License This model is released under the Gemma license, inherited from its base model, `google/gemma-3-4b-pt`. Please consult the official Gemma license terms for detailed usage guidelines. This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.
YanoljaNEXT-Rosetta-4B-2511-GGUF
YanoljaNEXT-Rosetta-12B
YanoljaNEXT-EEVE-Instruct-7B-v2-Preview
YanoljaNEXT-Rosetta-4B-2511
KoSOLAR-10.7B-v0.2
YanoljaNEXT-Rosetta-27B-2511
This model is a fine-tuned version of `google/gemma-3-27b-pt`. As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. Unlike our previous EEVE models, this model does not feature an expanded tokenizer. - Model Name: `yanolja/YanoljaNEXT-Rosetta-27B-2511` - Base Model: `google/gemma-3-27b-pt` This model is a 27-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON, YAML, XML formats) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages equally: - Arabic - Bulgarian - Chinese - Czech - Danish - Dutch - English - Finnish - French - German - Greek - Gujarati - Hebrew - Hindi - Hungarian - Indonesian - Italian - Japanese - Korean - Persian - Polish - Portuguese - Romanian - Russian - Slovak - Spanish - Swedish - Tagalog - Thai - Turkish - Ukrainian - Vietnamese While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. You can use this model with the `transformers` library as follows: The model outputs the final translation in the same structured format as the input (JSON, YAML, XML) when appropriate, or plain text for simple translations. Training Data The translation datasets were synthesized using fineweb corpora. - FineWeb Edu - FineWeb2 The model was fine-tuned on synthetic multilingual translation data to optimize performance across the supported language pairs. The following CHrF++ scores (WMT24++) demonstrate the model's competitive performance compared to other state-of-the-art translation models on English to Korean translation: | Model | CHrF++ Score (WMT24++) | |------------------------------------|--------------| | yanolja/YanoljaNEXT-Rosetta-27B-2511 | 37.21 | | yanolja/YanoljaNEXT-Rosetta-4B-2511 | 35.64 | | google/gemini-2.5-flash-lite | 35.23 | | yanolja/YanoljaNEXT-Rosetta-20B | 33.87 | | google/gemini-2.0-flash-001 | 33.81 | | openai/gpt-oss-120b | 31.51 | | openai/gpt-4.1-nano | 31.15 | | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 | 31.02 | | openai/gpt-oss-20b | 30.56 | | google/gemma-3-27b-it | 30.05 | YanoljaNEXT-Rosetta-27B-2511 achieves strong translation quality while maintaining efficient inference for its parameter size. Scores for the other language pairs can be found in the WMT24++ Evaluation Results. This model is intended for translating structured data (JSON, YAML, XML formats) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. Limitations The model is primarily optimized for processing structured data (JSON, YAML, XML). Its performance on unstructured text or other data formats may vary. In some cases, the model may produce invalid structured outputs, repetitive output, or inaccurate translations. License This model is released under the Gemma license, inherited from its base architecture. Please consult the official Gemma license terms for detailed usage guidelines. Acknowledgments This work was supported by the Korea Creative Content Agency (KOCCA) grant, funded by the Ministry of Culture, Sports and Tourism (MCST) in 2025 (Project Name: Cultivating Masters and Doctoral Experts to Lead Digital-Tech Tourism, Project Number: RS-2024-00442006, Contribution Rate: 100%). This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.