maritaca-ai
Sabia 7b
Sabiá-7B is Portuguese language model developed by Maritaca AI. Model Architecture: Sabiá-7B is an auto-regressive language model that uses the same architecture of LLaMA-1-7B. Tokenizer: It uses the same tokenizer as LLaMA-1-7B. Pretraining data: The model was pretrained on 7 billion tokens from the Portuguese subset of ClueWeb22, starting with the weights of LLaMA-1-7B and further trained for an additional 10 billion tokens, approximately 1.4 epochs of the training dataset. Data Freshness: The pretraining data has a cutoff of mid-2022. Paper: For more details, please refer to our paper: Sabiá: Portuguese Large Language Models Given that Sabiá-7B was trained solely on a language modeling objective without fine-tuning for instruction following, it is recommended for few-shot tasks rather than zero-shot tasks, like in the example below. If your GPU does not have enough RAM, try using int8 precision. However, expect some degradation in the model output quality when compared to fp16 or bf16. Below we show the results on the Poeta benchmark, which consists of 14 Portuguese datasets. For more information on the Normalized Preferred Metric (NPM), please refer to our paper. |Model | NPM | |--|--| |LLaMA-1-7B| 33.0| |LLaMA-2-7B| 43.7| |Sabiá-7B| 48.5| Below we show the average results on 6 English datasets: PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c, and OpenBookQA. |Model | NPM | |--|--| |LLaMA-1-7B| 50.1| |Sabiá-7B| 49.0| Open Portuguese LLM Leaderboard Evaluation Results Detailed results can be found here | Metric | Value | |--------------------------|---------| |Average |47.09| |ENEM Challenge (No Images)| 55.07| |BLUEX (No Images) | 47.71| |OAB Exams | 41.41| |Assin2 RTE | 46.68| |Assin2 STS | 1.89| |FaQuAD NLI | 58.34| |HateBR Binary | 61.93| |PT Hate Speech Binary | 64.13| |tweetSentBR | 46.64|