BSC-LT

63 models • 2 total models in database

Sort by:

ALIA-40b

--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation language: - bg - ca - code - cs - cy - da - de - el - en - es - et - eu - fi - fr - ga - gl - hr - hu - it - lt - lv - mt - nl - nn - \no - oc - pl - pt - ro - ru - sh - sk - sl - sr - sv - uk datasets: - oscar-corpus/colossal-oscar-1.0 - HuggingFaceFW/fineweb-edu - joelniklaus/eurlex_resources - joelniklaus/legal-mc4 - projecte-aina/CATalog - UFRGS/brwac - community-datasets/hrwac - danish-foundation-models/danish-

salamandraTA-7b-instruct

SalamandraTA-7b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-7b-base. The base model results from continually pre-training Salamandra-7b on parallel data and has not been published, but is reserved for internal use. SalamandraTA-7b-instruct is proficient in 35 European languages (plus 3 varieties) and supports translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, document-level-translation, automatic post-editing, grammar checking, machine translation evaluation, alternative translations, named-entity-recognition and context-aware translation. > [!WARNING] > DISCLAIMER: This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions. SalamandraTA-7b-base is a continual pre-training of Salamandra-7b using parallel data, resulting in a total of 424B tokens processed during training. | | | |-------------------------|:--------------| | Total Parameters | 7,768,117,248 | | Embedding Parameters | 1,048,576,000 | | Layers | 32 | | Hidden size | 4,096 | | Attention heads | 32 | | Context length | 8,192 | | Vocabulary size | 256,000 | | Precision | bfloat16 | | Embedding type | RoPE | | Activation Function | SwiGLU | | Layer normalization | RMS Norm | | Flash attention | ✅ | | Grouped Query Attention | ✅ | | Num. query groups | 8 | The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks. The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. SalamandraTA-7b-base was continually pre-trained using NVIDIA’s NeMo Framework, which leverages PyTorch Lightning for efficient model training in highly distributed settings. SalamandraTA-7b-instruct was produced with FastChat. All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center. The accelerated partition is composed of 1,120 nodes with the following specifications: - 4x Nvidia Hopper GPUs with 64GB HBM2 memory - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores) - 4x NDR200 (BW per node 800Gb/s) - 512 GB of Main memory (DDR5) - 460GB on NVMe storage You can translate between the following 35 languages (and 3 varieties): Aragonese, Asturian, Basque, Bulgarian, Catalan (and Catalan-Valencian variety), Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian (Bokmål and Nynorsk varieties), Occitan (and Aranese variety), Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian, Welsh. The instruction-following model uses the commonly adopted ChatML template: The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet. Using this template, each turn is preceded by a ` ` delimiter and the role of the entity (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the ` ` token. For machine translation tasks, you can use the following prompt template: For post-editing tasks, you can use the following prompt template: For document-level translation tasks, you can use the following prompt template: For named-entity recognition tasks, you can use the following prompt template: For fixing any mistakes in grammar, you can use the following prompt template: The pretraining corpus consists of 424 billion tokens of Catalan-centric, Spanish-centric, and English-centric parallel data, including all of the official European languages plus Catalan, Basque, Galician, Asturian, Aragonese and Aranese. It amounts to 6,574,251,526 parallel sentence pairs. This highly multilingual corpus is predominantly composed of data sourced from OPUS, with additional data taken from the NTEU Project, Aina Project, and other sources (see: Data Sources and References). Where little parallel Catalan xx data could be found, synthetic Catalan data was generated from the Spanish side of the collected Spanish xx corpora using Projecte Aina’s Spanish-Catalan model. The final distribution of languages was as below: Click the expand button below to see the full list of corpora included in the training data. | Dataset | Ca-xx Languages | Es-xx Langugages | En-xx Languages | |-----------------------------------------------|----------------------------------------------------------------|-----------------------------------------------|----------------------------------------------------------------| |AINA | en | | | |ARANESE-SYNTH-CORPUS-BSC | arn | | | |BOUA-SYNTH-BSC | | val | | |BOUMH | | val | | |BOUA-PILAR | | val | | |CCMatrix |eu | | ga | |DGT | |bg,cs,da,de,el ,et,fi,fr,ga,hr,hu,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,hu,lt,lv,mt,sh,sl| |DOGV-SYNTH-BSC | | val | | |DOGV-PILAR | | val | | |ELRC-EMEA | |bg,cs,da,hu,lt,lv,mt,pl,ro,sk,sl | et,hr,lv,ro,sk,sl | |EMEA | |bg,cs,da,el,fi,hu,lt,mt,nl,pl,ro,sk,sl,sv | et,mt | |EUBookshop |lt,pl,pt |cs,da,de,el,fi,fr,ga,it,lv,mt,nl,pl,pt,ro,sk,sl,sv |cy,ga| |Europarl | |bg,cs,da,el,en,fi,fr,hu,lt,lv,nl,pl,pt ,ro,sk,sl,sv | | |Europat | |en,hr | no | |GAITU Corpus | | | eu| |KDE4 |bg,cs,da,de,el ,et,eu,fi,fr,ga,gl,hr,it,lt,lv,nl,pl,pt,ro,sk,sl,sv |bg,ga,hr |cy,ga,nn,oc | |GlobalVoices | bg,de,fr,it,nl,pl,pt |bg,de,fr,pt | | |GNOME |eu,fr,ga,gl,pt |ga |cy,ga,nn| |JRC-Arquis | |cs,da,et,fr,lt,lv,mt,nl,pl ,ro,sv| et | |LES-CORTS-VALENCIANES-SYNTH-BSC | | val | | |MaCoCu | en | | hr,mt,uk | |MultiCCAligned |bg,cs,de,el,et,fi,fr,hr,hu,it,lt,lv,nl,pl,ro,sk,sv |bg,fi,fr,hr,it,lv,nl,pt |bg,cy,da,et,fi,hr,hu,lt,lv,no,sl,sr,uk| |MultiHPLT |en, et,fi,ga,hr,mt | |fi,ga,gl,hr,mt,nn,sr | |MultiParaCrawl |bg,da |de,en,fr,ga,hr,hu,it,mt,pt |bg,cs,da,de,el,et,fi,fr,ga,hr,hu,lt,lv,mt,nn,pl,ro,sk,sl,uk| |MultiUN | |fr | | |News-Commentary | |fr | | |NLLB |bg,da,el,en,et,fi,fr,gl,hu,it ,lt,lv,pt,ro,sk,sl |bg,cs,da,de,el ,et,fi,fr,hu,it,lt,lv,nl,pl,pt ,ro,sk,sl,sv| bg,cs,cy,da,de,el,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,no,oc,pl,pt,ro,ru,sk,sl,sr,sv,uk| |NÓS Authentic Corpus | | | gl | |NÓS Synthetic Corpus | | | gl | |NTEU | |bg,cs,da,de,el,en,et,fi,fr,ga,hr,hu,it,lt,lv,mt,nl,pl,pt,ro,sk,sl,sv | da,et,ga,hr,lt,lv,mt,ro,sk,sl,sv | |OpenSubtitles |bg,cs,da,de,el ,et,eu,fi,gl,hr,hu,lt,lv,nl,pl,pt,ro,sk,sl,sv |da,de,fi,fr,hr,hu,it,lv,nl | bg,cs,de,el,et,hr,fi,fr,hr,hu,no,sl,sr| |OPUS-100 | en | | gl | |StanfordNLP-NMT | | |cs | |Tatoeba |de,pt |pt | | |TildeModel | |bg | et,hr,lt,lv,mt | |UNPC | |en,fr | ru | |PILAR-VALENCIAN-AUTH | | val | | |PILAR-VALENCIAN-SYNTH | | val | | |WikiMatrix |bg,cs,da,de,el ,et,eu,fi,fr,gl,hr,hu,it,lt,nl,pl,pt,ro,sk,sl,sv |bg,en,fr,hr,it,pt | oc,sh | |Wikimedia | | |cy,nn | |XLENT |eu,ga,gl |ga |cy,et,ga,gl,hr,oc,sh| Datasets with "-BSC" in their names (e.g., BOUA-SYNTH-BSC, DOGV-SYNTH-BSC) are synthetic datasets obtained by machine translating pre-existing monolingual corpora with our own seq-to-seq models. These datasets were generated internally for model training and are not published. To consult the data summary document with the respective licences, please send an e-mail to [email protected]. - Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467 - Chaudhary, V., Tang, Y., Guzmán, F., Schwenk, H., & Koehn, P. (2019). Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, A. Martins, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post, M. Turchi, & K. Verspoor (Eds.), Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2) (pp. 261–266). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-5435 - DGT-Translation Memory—European Commission. (n.d.). Retrieved November 4, 2024, from https://joint-research-centre.ec.europa.eu/language-technology-resources/dgt-translation-memoryen - Eisele, A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation Documents. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/686Paper.pdf - El-Kishky, A., Chaudhary, V., Guzmán, F., & Koehn, P. (2020). CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5960–5969. https://doi.org/10.18653/v1/2020.emnlp-main.480 - El-Kishky, A., Renduchintala, A., Cross, J., Guzmán, F., & Koehn, P. (2021). XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 10424–10430. https://doi.org/10.18653/v1/2021.emnlp-main.814 - Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., & Joulin, A. (2020). Beyond English-Centric Multilingual Machine Translation (No. arXiv:2010.11125). arXiv. https://doi.org/10.48550/arXiv.2010.11125 - García-Martínez, M., Bié, L., Cerdà, A., Estela, A., Herranz, M., Krišlauks, R., Melero, M., O’Dowd, T., O’Gorman, S., Pinnis, M., Stafanovič, A., Superbo, R., & Vasiļevskis, A. (2021). Neural Translation for European Union (NTEU). 316–334. https://aclanthology.org/2021.mtsummit-up.23 - Gibert, O. de, Nail, G., Arefyev, N., Bañón, M., Linde, J. van der, Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A New Massive Multilingual Dataset for High-Performance Language Technologies (No. arXiv:2403.14009). arXiv. http://arxiv.org/abs/2403.14009 - Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X: Papers, 79–86. https://aclanthology.org/2005.mtsummit-papers.11 - Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., … Adeyemi, M. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacla00447 - Rozis, R.,Skadiņš, R (2017). Tilde MODEL - Multilingual Open Data for EU Languages. https://aclanthology.org/W17-0235 - Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (No. arXiv:1907.05791). arXiv. https://doi.org/10.48550/arXiv.1907.05791 - Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944 - Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (n.d.). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. http://www.lrec-conf.org/proceedings/lrec2006/pdf/340pdf - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. In A. Ovalle, K.-W. Chang, N. Mehrabi, Y. Pruksachatkun, A. Galystan, J. Dhamala, A. Verma, T. Cao, A. Kumar, & R. Gupta (Eds.), Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (pp. 208–220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.trustnlp-1.18 - Tiedemann, J. (23-25). Parallel Data, Tools and Interfaces in OPUS. In N. C. (Conference Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463Paper - Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (n.d.). The United Nations Parallel Corpus v1.0. https://aclanthology.org/L16-1561 This model has been fine-tuned on ~135k instructions, primarily targeting machine translation performance for Catalan, English, and Spanish. Additional instruction data for other European and closely related Iberian languages was also included, as it yielded a positive impact on the languages of interest. That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing. A portion of our fine-tuning data comes directly from, or is sampled from TowerBlocks. We also created additional datasets for our main languages of interest. While tasks relating to machine translation are included, it’s important to note that no chat data was used in the fine-tuning process. The final distribution of tasks was as below: Click the expand button below to see the full list of tasks included in the finetuning data. | Task | Source | Languages | Count | |----------------------------------|------------------------------------------------------------------------------------------|----------------------------------------------------------------|--------| | Multi-reference Translation | TowerBlocks: Tatoeba Dev (filtered) | mixed | 10000 | | Paraphrase | TowerBlocks: PAWS-X Dev | mixed | 3521 | | Named-entity Recognition | AnCora-Ca-NER | ca | 12059 | | Named-entity Recognition | BasqueGLUE, EusIE | eu | 4304 | | Named-entity Recognition | SLI NERC Galician Gold Corpus | gl | 6483 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | pt | 854 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | nl | 800 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | es | 1654 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | en | 1671 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | ru | 800 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | it | 858 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | fr | 857 | | Named-entity Recognition | TowerBlocks: MultiCoNER 2022 and 2023 Dev | de | 1312 | | Terminology-aware Translation | TowerBlocks: WMT21 Terminology Dev (filtered) | en-ru | 50 | | Terminology-aware Translation | TowerBlocks: WMT21 Terminology Dev (filtered) | en-fr | 29 | | Automatic Post Editing | TowerBlocks: QT21, ApeQuest | en-fr | 6133 | | Automatic Post Editing | TowerBlocks: QT21, ApeQuest | en-nl | 9077 | | Automatic Post Editing | TowerBlocks: QT21, ApeQuest | en-pt | 5762 | | Automatic Post Editing | TowerBlocks: QT21, ApeQuest | de-en | 10000 | | Automatic Post Editing | TowerBlocks: QT21, ApeQuest | en-de | 10000 | | Machine Translation Evaluation | TowerBlocks-sample: WMT20 to WMT22 Metrics MQM, WMT17 to WMT22 Metrics Direct Assessments | en-ru, en-pl, ru-en, en-de, en-ru, de-fr, de-en, en-de | 353 | | Machine Translation Evaluation | Non-public | four pivot languages (eu, es, ca, gl) paired with European languages (bg, cs, da, de, el, en, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv) | 9700 | | General Machine Translation | TowerBlocks: WMT14 to WMT21, NTREX, Flores Dev, FRMT, QT21, ApeQuest, OPUS (Quality Filtered), MT-GenEval | nl-en, en-ru, it-en, fr-en, es-en, en-fr, ru-en, fr-de, en-nl, de-fr | 500 | | General Machine Translation | Non-public | three pivot languages (es, ca, en) paired with European languages (ast, arn, arg, bg, cs, cy, da, de, el, et, fi, ga, gl, hr, it, lt, lv, mt, nb, nn, nl, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk, eu) | 9350 | | Fill-in-the-Blank | Non-public | five pivot languages (ca, es, eu, gl, en) paired with European languages (cs, da, de, el, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv) | 11500 | | Document-level Translation | Non-public | two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) | 7600 | | Paragraph-level Translation | Non-public | two pivot languages (es, en) paired with European languages (bg, cs, da, de, el, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, ru, sk, sv) | 7600 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-it | 348 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-ru | 454 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-fr | 369 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-nl | 417 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-es | 431 | | Context-Aware Translation | TowerBlocks: MT-GenEval | en-de | 558 | |Total | | | 135,404 | The non-public portion of this dataset was jointly created by the ILENIA partners: BSC-LT, HiTZ, and CiTIUS. For further information regarding the instruction-tuning data, please contact . - Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks (No. arXiv: 2402.17733). arXiv. https://arxiv.org/abs/2402.17733 - Armengol-Estapé, J., Carrino, C. P., Rodriguez-Penagos, C., de Gibert Bonet, O., Armentano-Oller, C., Gonzalez-Agirre, A., Melero, M., & Villegas, M. (2021). Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 4933–4946. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.437 - Currey, A., Nadejde, M., Pappagari, R. R., Mayer, M., Lauly, S., Niu, X., Hsu, B., & Dinu, G. (2022). MT-GenEval: A counterfactual and contextual dataset for evaluating gender accuracy in machine translation. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 4287–4299). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.288 - Federmann, C., Kocmi, T., & Xin, Y. (2022). NTREX-128 – News test references for MT evaluation of 128 languages. Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, 21–24. Association for Computational Linguistics. https://aclanthology.org/2022.sumeval-1.4 - Ive, J., Specia, L., Szoc, S., Vanallemeersch, T., Van den Bogaert, J., Farah, E., Maroti, C., Ventura, A., & Khalilov, M. (2020). A post-editing dataset in the legal domain: Do we underestimate neural machine translation quality? In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3692–3697). European Language Resources Association. https://aclanthology.org/2020.lrec-1.455/ - Malmasi, S., Fang, A., Fetahu, B., Kar, S., & Rokhlenko, O. (2022). MultiCoNER: A large-scale multilingual dataset for complex named entity recognition. Proceedings of the 29th International Conference on Computational Linguistics, 3798–3809. International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.334/ - NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Mejia Gonzalez, G., Hansanti, P., Hoffman, J., Jarrett, S., Sadagopan, K. R., Rowe, D., Spruit, S., Tran, C., Andrews, P., Ayan, N. F., Bhosale, S., Edunov, S., Fan, A., Gao, C., Goswami, V., Guzmán, F., Koehn, P., Mourachko, A., Ropers, C., Saleem, S., Schwenk, H., & Wang, J. (2022). No language left behind: Scaling human-centered machine translation (No. arXiv: 2207.04672). arXiv. https://arxiv.org/abs/2207.04672 - Riley, P., Dozat, T., Botha, J. A., Garcia, X., Garrette, D., Riesa, J., Firat, O., & Constant, N. (2022). FRMT: A benchmark for few-shot region-aware machine translation (No. arXiv: 2210.00193). arXiv. https://doi.org/10.48550/ARXIV.2210.00193 - Specia, L., Harris, K., Blain, F., Burchardt, A., Macketanz, V., Skadiņa, I., Negri, M., & Turchi, M. (2017). Translation quality and productivity: A study on rich morphology languages. Proceedings of Machine Translation Summit XVI, 55–71. Nagoya, Japan. - Tiedemann, J. (2020). The Tatoeba translation challenge – Realistic data sets for low-resource and multilingual MT. Proceedings of the Fifth Conference on Machine Translation, 1174–1182. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wmt-1.139 - Urbizu, G., San Vicente, I., Saralegi, X., Agerri, R., & Soroa, A. (2022). BasqueGLUE: A natural language understanding benchmark for Basque. Proceedings of the Language Resources and Evaluation Conference, 1603–1612. European Language Resources Association. https://aclanthology.org/2022.lrec-1.172 - Yang, Y., Zhang, Y., Tar, C., & Baldridge, J. (2019). PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3687–3692). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1382 - Zubillaga, M., Sainz, O., Estarrona, A., Lopez de Lacalle, O., & Agirre, E. (2024). Event extraction in Basque: Typologically motivated cross-lingual transfer-learning analysis (No. arXiv: 2404.06392). arXiv. https://arxiv.org/abs/2404.06392 Below are the evaluation results on the Flores+200 devtest set, compared against the state-of-the-art MADLAD400-7B-mt model (Kudugunta, S., et al.) and SalamandraTA-7b-base model. These results cover the translation directions CA-XX, ES-XX, EN-XX, as well as XX-CA, XX-ES, and XX-EN. The metrics have been computed excluding Asturian, Aranese, and Aragonese, as we report them separately. The evaluation was conducted using MT-Lens, following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). We report the following metrics: - `BLEU`: Sacrebleu implementation. Signature: nrefs:1— case:mixed— eff:no— tok:13a— smooth:exp—version:2.3.1 - `TER`: Sacrebleu implementation. - `ChrF`: Sacrebleu implementation. - `Comet`: Model checkpoint: "Unbabel/wmt22-comet-da". - `Comet-kiwi`: Model checkpoint: "Unbabel/wmt22-cometkiwi-da". - `Bleurt`: Model checkpoint: "lucadiliello/BLEURT-20". - `MetricX`: Model checkpoint: "google/metricx-23-xl-v2p0". - `MetricX-QE`: Model checkpoint: "google/metricx-23-qe-xl-v2p0". This section presents the evaluation metrics for English translation tasks. | | Bleu↑ | Ter↓ | ChrF↑ | Comet↑ | Comet-kiwi↑ | Bleurt↑ | MetricX↓ | MetricX-QE↓ | |:---------------------------------|-------:|------:|-------:|--------:|-------------:|---------:|----------:|-------------:| | EN-XX | | | | | | | | | | SalamandraTA-7b-instruct | 35.20 | 53.40 | 61.58 | 0.89 | 0.86 | 0.78 | 0.96 | 0.81 | | MADLAD400-7B | 35.73 | 51.87 | 63.46 | 0.88 | 0.85 | 0.79 | 1.16 | 1.10 | | SalamandraTA-7b-base | 34.99 | 52.64 | 62.58 | 0.87 | 0.84 | 0.77 | 1.45 | 1.23 | | XX-EN | | | | | | | | | | SalamandraTA-7b-instruct | 44.37 | 42.49 | 68.29 | 0.89 | 0.86 | 0.80 | 1.05 | 0.99 | | MADLAD400-7B | 43.20 | 43.33 | 67.98 | 0.89 | 0.86 | 0.80 | 1.13 | 1.15 | | SalamandraTA-7b-base | 44.12 | 43.00 | 68.43 | 0.89 | 0.85 | 0.80 | 1.13 | 1.22 | This section presents the evaluation metrics for Spanish translation tasks. | | Bleu↑ | Ter↓ | ChrF↑ | Comet↑ | Comet-kiwi↑ | Bleurt↑ | MetricX↓ | MetricX-QE↓ | |:---------------------------------|-------:|------:|-------:|--------:|-------------:|---------:|----------:|-------------:| | ES-XX | | | | | | | | | | SalamandraTA-7b-instruct | 23.68 | 67.31 | 53.98 | 0.87 | 0.83 | 0.76 | 0.93 | 0.80 | | MADLAD400-7B | 22.48 | 68.91 | 53.93 | 0.86 | 0.83 | 0.75 | 1.09 | 1.14 | | SalamandraTA-7b-base | 21.63 | 70.08 | 52.98 | 0.86 | 0.83 | 0.74 | 1.24 | 1.12 | | XX-ES | | | | | | | | | | SalamandraTA-7b-instruct | 26.40 | 62.27 | 53.54 | 0.85 | 0.84 | 0.74 | 0.80 | 1.07 | | MADLAD400-7B | 24.85 | 61.82 | 53.00 | 0.85 | 0.84 | 0.74 | 1.05 | 1.50 | | SalamandraTA-7b-base | 24.71 | 62.33 | 52.96 | 0.85 | 0.84 | 0.73 | 1.06 | 1.37 | This section presents the evaluation metrics for Catalan translation tasks. | | Bleu↑ | Ter↓ | ChrF↑ | Comet↑ | Comet-kiwi↑ | Bleurt↑ | MetricX↓ | MetricX-QE↓ | |:---------------------------------|-------:|------:|-------:|--------:|-------------:|---------:|----------:|-------------:| | CA-XX | | | | | | | | | | SalamandraTA-7b-instruct | 29.50 | 59.26 | 58.21 | 0.88 | 0.81 | 0.77 | 0.97 | 0.98 | | MADLAD400-7B | 29.37 | 59.01 | 58.47 | 0.87 | 0.81 | 0.77 | 1.08 | 1.31 | | SalamandraTA-7b-base | 29.06 | 59.32 | 58.00 | 0.87 | 0.81 | 0.76 | 1.23 | 1.28 | | XX-CA | | | | | | | | | | SalamandraTA-7b-instruct | 34.51 | 54.21 | 60.10 | 0.86 | 0.81 | 0.76 | 0.90 | 1.29 | | MADLAD400-7B | 33.02 | 55.01 | 59.38 | 0.86 | 0.81 | 0.75 | 1.18 | 1.79 | | SalamandraTA-7b-base | 32.75 | 55.78 | 59.42 | 0.86 | 0.81 | 0.75 | 1.17 | 1.63 | This section presents the evaluation metrics for Galician translation tasks. | | Bleu↑ | Ter↓ | ChrF↑ | Comet↑ | Comet-kiwi↑ | Bleurt↑ | MetricX↓ | MetricX-QE↓ | |:---------------------------------|-------:|------:|-------:|--------:|-------------:|---------:|----------:|-------------:| | GL-XX | | | | | | | | | | SalamandraTA-7b-instruct | 36.95 | 50.12 | 62.55 | 0.88 | 0.85 | 0.77 | 0.86 | 0.98 | | MADLAD400-7B | 26.43 | 64.30 | 55.99 | 0.86 | 0.85 | 0.76 | 1.35 | 2.06 | | SalamandraTA-7b-base | 27.47 | 61.39 | 56.96 | 0.87 | 0.82 | 0.76 | 1.23 | 1.29 | | XX-GL | | | | | | | | | | SalamandraTA-7b-instruct | 34.37 | 52.49 | 60.99 | 0.88 | 0.85 | 0.73 | 0.75 | 0.92 | | MADLAD400-7B | 27.77 | 59.46 | 54.92 | 0.84 | 0.85 | 0.67 | 1.42 | 2.72 | | SalamandraTA-7b-base | 28.22 | 59.52 | 56.28 | 0.85 | 0.82 | 0.69 | 1.27 | 1.78 | This section presents the evaluation metrics for Basque translation tasks. | | Bleu↑ | Ter↓ | ChrF↑ | Comet↑ | Comet-kiwi↑ | Bleurt↑ | MetricX↓ | MetricX-QE↓ | |:---------------------------------|-------:|------:|-------:|--------:|-------------:|---------:|----------:|-------------:| | EU-XX | | | | | | | | | | SalamandraTA-7b-instruct | 29.89 | 58.54 | 56.66 | 0.87 | 0.85 | 0.76 | 0.90 | 0.89 | | MADLAD400-7B | 21.26 | 69.75 | 49.80 | 0.85 | 0.82 | 0.72 | 1.54 | 2.71 | | SalamandraTA-7b-base | 22.87 | 67.38 | 52.19 | 0.86 | 0.79 | 0.74 | 1.19 | 1.61 | | XX-EU | | | | | | | | | | SalamandraTA-7b-instruct | 18.89 | 71.74 | 57.16 | 0.87 | 0.84 | 0.82 | 0.58 | 0.44 | | MADLAD400-7B | 13.64 | 85.01 | 50.96 | 0.82 | 0.80 | 0.78 | 2.09 | 3.58 | | SalamandraTA-7b-base | 17.01 | 75.92 | 55.22 | 0.85 | 0.77 | 0.80 | 1.04 | 1.17 | The tables below summarize the performance metrics for English, Spanish, and Catalan to Asturian, Aranese and Aragonese compared against Transducens/IbRo-nllb (Galiano Jimenez, et al.), NLLB-200-3.3B (Costa-jussà et al., 2022) and SalamandraTA-2B. | | source | target | Bleu ↑ | Ter ↓ | ChrF ↑ | |:-------------------------|:---------|:---------|:----------|:----------|:----------| | SalamandraTA-7b-instruct | en | ast | 31.79 | 54.07 | 61.78 | | SalamandraTA-7b-base | en | ast | 26.40 | 64.02 | 57.35 | | Transducens/IbRo-nllb | en | ast | 20.56 | 63.92 | 53.32 | | | | | | | | | SalamandraTA-7b-instruct | en | arn | 22.77 | 66.06 | 52.61 | | SalamandraTA-7b-base | en | arn | 14.13 | 74.05 | 46.17 | | Transducens/IbRo-nllb | en | arn | 12.81 | 73.21 | 45.76 | | | | | | | | | SalamandraTA-7b-instruct | en | arg | 19.74 | 71.58 | 51.08 | | Transducens/IbRo-nllb | en | arg | 14.07 | 70.37 | 46.89 | | SalamandraTA-7b-base | en | arg | 12.24 | 73.48 | 44.75 | | | source | target | Bleu ↑ | Ter ↓ | ChrF ↑ | |:-------------------------|:---------|:---------|:----------|:----------|:----------| | SalamandraTA-7b-instruct | es | ast | 20.66 | 71.81 | 53.14 | | SalamandraTA-7b-base | es | ast | 17.65 | 75.78 | 51.05 | | Transducens/IbRo-nllb | es | ast | 16.79 | 76.36 | 50.89 | | | | | | | | | SalamandraTA-7b-base | es | arn | 51.59 | 35.51 | 73.50 | | Transducens/IbRo-nllb | es | arn | 50.20 | 36.60 | 73.16 | | SalamandraTA-7b-instruct | es | arn | 47.37 | 39.29 | 70.65 | | | | | | | | | Transducens/IbRo-nllb | es | arg | 59.75 | 28.01 | 78.73 | | SalamandraTA-7b-base | es | arg | 53.96 | 31.51 | 76.08 | | SalamandraTA-7b-instruct | es | arg | 44.10 | 39.98 | 71.12 | | | source | target | Bleu ↑ | Ter ↓ | ChrF ↑ | |:-------------------------|:---------|:---------|:----------|:----------|:----------| | SalamandraTA-7b-instruct | ca | ast | 28.13 | 58.84 | 58.98 | | SalamandraTA-7b-base | ca | ast | 26.11 | 63.63 | 58.08 | | Transducens/IbRo-nllb | ca | ast | 24.77 | 61.60 | 57.49 | | | | | | | | | SalamandraTA-7b-base | ca | arn | 31.76 | 53.71 | 60.71 | | Transducens/IbRo-nllb | ca | arn | 31.22 | 54.30 | 60.30 | | SalamandraTA-7b-instruct | ca | arn | 30.89 | 54.70 | 59.78 | | | | | | | | | Transducens/IbRo-nllb | ca | arg | 24.44 | 60.79 | 55.51 | | SalamandraTA-7b-base | ca | arg | 22.53 | 62.37 | 54.32 | | SalamandraTA-7b-instruct | ca | arg | 20.96 | 65.64 | 52.41 | Below are the evaluation results for gender aware translation evaluated on the MT-GenEval dataset (Currey, A. et al.). These have been calculated for translation from English into German, Spanish, French, Italian, Portuguese and Russian and are compared against MADLAD400-7B-mt, TowerInstruct-7B-v0.2 and the SalamandraTA-7b-base model. Evaluation was conducted using MT-Lens and is reported as accuracy computed using the accuracy metric provided with MT-GenEval. | | Source | Target | Masc | Fem | Pair | |:--|:--|:--|:--|:--|:--| | MADLAD400-7B | en | de | 0.877 | 0.823 | 0.713 | | SalamandraTA-7b-base | en | de | 0.857 | 0.770 | 0.660 | | SalamandraTA-7b-instruct | en | de | 0.863 | 0.867 | 0.740 | | TowerInstruct-7B-v0.2 | en | de | 0.863 | 0.840 | 0.727 | | | | | | | | | MADLAD400-7B | en | es | 0.887 | 0.780 | 0.687 | | SalamandraTA-7b-base | en | es | 0.890 | 0.733 | 0.643 | | SalamandraTA-7b-instruct | en | es | 0.860 | 0.837 | 0.710 | | TowerInstruct-7B-v0.2 | en | es | 0.850 | 0.823 | 0.693 | | | | | | | | | MADLAD400-7B | en | fr | 0.873 | 0.777 | 0.663 | | SalamandraTA-7b-base | en | fr | 0.887 | 0.710 | 0.617 | | SalamandraTA-7b-instruct | en | fr | 0.900 | 0.813 | 0.730 | | TowerInstruct-7B-v0.2 | en | fr | 0.880 | 0.823 | 0.717 | | | | | | | | | MADLAD400-7B | en | it | 0.907 | 0.663 | 0.597 | | SalamandraTA-7b-base | en | it | 0.893 | 0.593 | 0.513 | | SalamandraTA-7b-instruct | en | it | 0.913 | 0.780 | 0.707 | | TowerInstruct-7B-v0.2 | en | it | 0.947 | 0.747 | 0.713 | | | | | | | | | MADLAD400-7B | en | pt | 0.923 | 0.687 | 0.627 | | SalamandraTA-7b-base | en | pt | 0.923 | 0.650 | 0.597 | | SalamandraTA-7b-instruct | en | pt | 0.933 | 0.797 | 0.747 | | TowerInstruct-7B-v0.2 | en | pt | 0.907 | 0.730 | 0.670 | | | | | | | | | MADLAD400-7B | en | ru | 0.940 | 0.797 | 0.740 | | SalamandraTA-7b-base | en | ru | 0.933 | 0.713 | 0.653 | | SalamandraTA-7b-instruct | en | ru | 0.950 | 0.830 | 0.783 | | TowerInstruct-7B-v0.2 | en | ru | 0.933 | 0.797 | 0.733 | | | | | | | | Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found at Salamandra-7B model card. With regard to MT models, the only analysis related to bias which we have conducted is the MT-GenEval evaluation. No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework MT-Lens. Note that the model has only undergone preliminary instruction tuning. We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications. Author The Language Technologies Unit from Barcelona Supercomputing Center. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. Funding This work has been promoted and financed by the Government of Catalonia through the Aina Project. This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337. The success of this project has been made possible thanks to the invaluable contributions of our partners in the ILENIA Project: HiTZ, and CiTIUS. Their efforts have been instrumental in advancing our work, and we sincerely appreciate their help and support. Disclaimer Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence. The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use. If you find our model useful, we would appreciate if you could cite our work as follows:

ALIA-40b-instruct-2601

salamandra-7b-vision

MrBERT

license:apache-2.0

MrBERT-es

license:apache-2.0

salamandra-2b-instruct

whisper-large-v3-ca-punctuated-3370h

license:apache-2.0

salamandraTA-2b-instruct

salamandra-2b

MRoBERTa

license:apache-2.0

vocos-mel-22khz

license:apache-2.0

ALIA-40b-instruct-2512

salamandra-7b-instruct-tools-16k

salamandraTA-7B-instruct-GGUF

salamandra-7b-instruct-tools

ALIA-40b-instruct

ALIA-40b-instruct-2601-GGUF

license:apache-2.0

ALIA-40b-instruct-2512-GGUF

license:apache-2.0

salamandraTA-2B-instruct-GGUF

This model is the GGUF-quantized version of SalamandraTA-2b-instruct. The model weights are quantized from FP16 to Q4KM quantization Q80 (8-bit quantization), (4-bit weights with K-means clustering quantization) and Q3KM (3-but weights with K-means clustering quantization) using the Llama.cpp framework. Inferencing with this model can be done using VLLM. SalamandraTA-2b-instruct is a translation LLM that has been instruction-tuned from SalamandraTA-2b-base. The base model results from continually pre-training Salamandra-2b on parallel data and has not been published, but is reserved for internal use. SalamandraTA-2b-instruct is proficient in 35 European languages (plus 3 varieties) and supports translation-related tasks, namely: sentence-level-translation, paragraph-level-translation, automatic post-editing, grammar checking, machine translation evaluation, alternative translations, named-entity-recognition and context-aware translation. > [!WARNING] > DISCLAIMER: This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions. The entire Salamandra family is released under a permissive Apache 2.0 license). The following example code works under ``Python 3.10.4``, ``vllm==0.7.3``, ``torch==2.5.1`` and ``torchvision==0.20.1``, though it should run on any current version of the libraries. This is an example of translation using the model: Author The Language Technologies Unit from Barcelona Supercomputing Center. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. Funding This work has been promoted and financed by the Government of Catalonia through the Aina Project. This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337. The success of this project has been made possible thanks to the invaluable contributions of our partners in the ILENIA Project: HiTZ, and CiTIUS. Their efforts have been instrumental in advancing our work, and we sincerely appreciate their help and support. Disclaimer Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence. The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

RoBERTa-ca

license:apache-2.0

MrBERT-biomed

license:apache-2.0

whisper-bsc-large-v3-cat

license:apache-2.0

ALIA-40b-instruct_Q8_0

license:apache-2.0

hubert-base-los-2k

- Model Description - Intended Uses and Limitations - Pre-training Details - Indirect evaluation results - How to use the model - Citation - Additional Information This is a HuBERT Base model pre-trained using 2,000 hours of Iberian languages speech data (Spanish, Catalan, Basque, and Galician). The model architecture is the same as the original HuBERT Base model, which contains 12 transformer layers. Pre-training was done by Barcelona Supercomputing Center. This pre-trained model generates Speech Representations that can be used for any Iberian speech-related task. This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for Automatic Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model for Speech Recognition. For an explanation of how to fine-tune the model for Audio Classification, check out this tutorial. This model was pre-trained using code from the official repository, and the detailed training configuration can be found in the same repository and the original paper. For pre-training, a 2,000 hours dataset was created using subsets from training splits from the following datasets: | Dataset | Language | Selected hours | Comments | |---------|----------|----------------|----------| | Basque Parliament Speech Corpus 1.0 | Spanish | 191 | | | VoxPopuli | Spanish | 152 | | | CommonVoice 21 | Spanish | 120 | | | VoxForge Spanish | Spanish | 37 | | | Catalan Youtube Speech | Catalan | 170 | | | 3CatParla | Catalan | 170 | This dataset is private and is planned to be published as public soon. | | CommonVoice 21 | Catalan | 44 | | | Corts Valencianes | Catalan | 44 | Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version. | | parlamentparlav3 | Catalan | 44 | This dataset is private and is planned to be published as public soon. | | IB3 - Speech Corpus for Catalan-varieties ASR | Catalan | 28 | This dataset is private and is planned to be published as public soon. | | Basque Parliament Speech Corpus 1.0 | Basque | 334 | | | CommonVoice 21 | Basque | 166 | | | NosRG-Podcast-GL | Galician | 250 | | | NosParlaSpeech-GL | Galician | 100 | | | CommonVoice 21 | Galician | 90 | | | NosTranscrispeech-GL | Galician | 35 | | | NosCeltia-GL | Galician | 25 | | To assess the pre-trained Catalan Speech Representations' quality, we evaluated them using two indirect tasks: Automatic Speech Recognition (ASR) and Language Identification (LID). We created train and validation ASR-labelled datasets using a 400 hours subsample from the pre-training dataset split. For testing, we created a test split concatenating all the test splits from: - CommonVoice 21 - Basque Parliament Speech Corpus 1.0 - VoxPopuli - 3CatParla - Corts Valencianes - parlamentparlav3 - Catalan Youtube Speech - NosParlaSpeech-GL - NosRG-Podcast-GL - NosTranscrispeech-GL We fine-tuned on this ASR-labelled 400 hours training split the following models: - Iberian pre-trained HuBERT: BSC-LT/hubert-base-los-2k (our model) - English pre-trained HuBERT: facebook/hubert-base-ls960 - Multi-lingual pre-trained HuBERT: utter-project/mHuBERT-147 All of these models were pre-trained using exactly the same configurations. We trained them for 20 epochs. For the fine-tuning process, we froze models' parameters using the freezefeatureencoder() method. hubert-base-los-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 95% of them were fine-tuned. The results were the following: | Model | Train WER | Validation WER | Test WER ↑ | |-------------------------|--------|-------|-------| | hubert-base-los-2k | 5.6% | 8.5% | 12.8% | | mHuBERT-147 | 8.1% | 11.2% | 15.9% | | hubert-base-ls960 | 11.6% | 15.2% | 20.7% | We created train and validation Language Identification labelled datasets using a 200 hours subsample from the pre-training dataset split (excluding Common Voice splits). For testing, we created a test split concatenating all the Spanish, Catalan, Basque, and Galician test splits from CommonVoice 21. We fine-tuned on this 200 hours labelled training split the following models: - Iberian pre-trained HuBERT: BSC-LT/hubert-base-los-2k (our model) - English pre-trained HuBERT: facebook/hubert-base-ls960 - Multi-lingual pre-trained HuBERT: utter-project/mHuBERT-147 All of these models were pre-trained using exactly the same configurations. We trained them for 10 epochs. For the fine-tuning process, we froze models' parameters using the freezebasemodel() method. hubert-base-los-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 0.2% of them were fine-tuned. | Model | Train f1-macro | Validation f1-macro | Test f1-macro ↓ | |------------------------|--------|-------|-------| | hubert-base-los-2k | 99.6% | 99.6% | 69.2% | | mHuBERT-147 | 97.2% | 97.6% | 37.6% | | hubert-base-ls960 | 91.6% | 92.1% | 20.3% | To obtain Speech Representations (HuBERT outputs) from audio in Iberian languages using this model, you can follow this example: (Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended). Important remark: the k-means model available in this repo and used for extracting Discrete Speech Representations was trained using HuBERT's 6th layer. To obtain Discrete Speech Representations (HuBERT's k-means centroids) from audio in Iberian languages using this model, you can follow this example: (Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended). In order to use this model for Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model for Speech Recognition. For an explanation of how to fine-tune the model for Audio Classification, check out this tutorial. If this model contributes to your research, please cite the work: The pre-training process was performed during 2025, in the Language Technologies Unit of the Barcelona Supercomputing Center. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. Be aware that the model may have biases and/or any other undesirable distortions. When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.

license:apache-2.0

faster-whisper-large-v3-ca-punctuated-3370h

license:apache-2.0

roberta-base-ca

license:apache-2.0

roberta-base-biomedical-clinical-es

license:apache-2.0

salamandra-7b-instruct-gptq

salamandra-TAV-7b

MrBERT-legal

license:apache-2.0

MrBERT-ca

license:apache-2.0

salamandraTA-7B-academic

This repository contains the model SalamandraTA-7B-academic, which is a Machine Translation fine-tunning of the Salamandra7B-Instruct. This model has been obtained following the procedures shown in ACADATA: Parallel Dataset of Academic Data for Machine Translation. > [!WARNING] > DISCLAIMER: This version of Salamandra is tailored exclusively for translation tasks. Even if the Machine Translation version has been obtained after fine-tunning an instructed version the chat capabilities have not been tested. For this we refer to the used instructed version. | | | |-------------------------|:--------------| | Total Parameters | 7,768,117,248 | | Embedding Parameters | 1,048,576,000 | | Layers | 32 | | Hidden size | 4,096 | | Attention heads | 32 | | Context length | 8,192 | | Vocabulary size | 256,000 | | Precision | bfloat16 | | Embedding type | RoPE | | Activation Function | SwiGLU | | Layer normalization | RMS Norm | | Flash attention | ✅ | | Grouped Query Attention | ✅ | | Num. query groups | 8 | The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks. The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. SalamandraTA-7B-academic was instructed with FastChat. All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center. The accelerated partition is composed of 1,120 nodes with the following specifications: - 4x Nvidia Hopper GPUs with 64GB HBM2 memory - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores) - 4x NDR200 (BW per node 800Gb/s) - 512 GB of Main memory (DDR5) - 460GB on NVMe storage SalamandraTA-7B-academic was fine-tuned using ACAD-Train dataset which focuses on pairs involving English, Iberian Peninsula languages, and several Central European languages, namely: Asturian (ast), Catalan (ca), German (de), Greek (el), Spanish (es), English (en), Basque (eu), French (fr), Galician (gl), Italian (it), Dutch (nl) and Portuguese (pt). The dataset includes 48 unique language pairs. Since each pair is used for translation in both directions (e.g., English to Spanish and Spanish to English), this results in the 96 total supported directions. The most frequent language pairs, accounting for 96.5% of the dataset, are: - English - Spanish (en-es) - English - French (en-fr) - English - Catalan (en-ca) - Catalan - Spanish (ca-es) - Spanish - French (es-fr) - English - Portuguese (en-pt) A comprehensive list of all language pairs included in the ACAD-Train dataset. The instruction-following model uses the commonly adopted ChatML template: The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet. Using this template, each turn is preceded by a ` ` delimiter and the role of the entity (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the ` ` token. The following prompt template is recommended, since it is the one used during training: The corpus used for the instruction tuning is ACAData. For more details about the corpus construction, you can refer to the Paper. | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | xx → en | GPT-mini | 46.03 | 1.00 | 0.60 | 0.84 | 0.77 | | | GPT-nano | 41.30 | 0.97 | 0.55 | 0.84 | 0.78 | | | Gemini-2 | 48.65 | 1.00 | 0.61 | 0.84 | 0.77 | | | Gemini-2.5 | 45.10 | 0.98 | 0.58 | 0.84 | 0.77 | | | Llama-3-8B | 43.12 | 0.99 | 0.56 | 0.83 | 0.76 | | | Gemma-3-27B | 46.37 | 0.98 | 0.59 | 0.84 | 0.77 | | | MADLAD-7B | 38.69 | 0.86 | 0.51 | 0.81 | 0.77 | | | Salamandra-2B | 37.09 | 0.92 | 0.52 | 0.82 | 0.75 | | |   + ACADTRAIN | 48.45 | 1.00 | 0.61 | 0.83 | 0.76 | | | Salamandra-7B | 45.87 | 0.99 | 0.59 | 0.83 | 0.76 | | |   + ACADTRAIN | 50.07 | 1.00 | 0.62 | 0.84 | 0.76 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | en → xx | GPT-mini | 45.01 | 0.99 | - | 0.86 | 0.82 | | | GPT-nano | 43.78 | 1.00 | - | 0.86 | 0.82 | | | Gemini-2 | 48.00 | 0.99 | - | 0.87 | 0.82 | | | Gemini-2.5 | 47.75 | 0.99 | - | 0.87 | 0.82 | | | Llama-3-8B | 39.87 | 0.99 | - | 0.85 | 0.81 | | | Gemma-3-27B | 46.29 | 0.99 | - | 0.86 | 0.82 | | | MADLAD-7B | 36.08 | 0.82 | - | 0.83 | 0.80 | | | Salamandra-2B | 32.91 | 0.90 | - | 0.83 | 0.78 | | |   + ACADTRAIN | 46.86 | 0.98 | - | 0.86 | 0.81 | | | Salamandra-7B | 42.55 | 0.98 | - | 0.86 | 0.81 | | |   + ACADTRAIN | 49.20 | 0.98 | - | 0.86 | 0.81 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | xx → es | GPT-mini | 60.60 | 0.98 | - | 0.86 | 0.82 | | | GPT-nano | 57.88 | 0.99 | - | 0.86 | 0.82 | | | Gemini-2 | 62.02 | 0.99 | - | 0.86 | 0.82 | | | Gemini-2.5 | 61.43 | 0.98 | - | 0.87 | 0.82 | | | Llama-3-8B | 55.4 | 0.98 | - | 0.86 | 0.81 | | | Gemma-3-27B | 60.71 | 0.98 | - | 0.86 | 0.82 | | | MADLAD-7B | 43.44 | 0.76 | - | 0.83 | 0.81 | | | Salamandra-2B | 50.09 | 0.92 | - | 0.85 | 0.80 | | |   + ACADTRAIN | 61.97 | 0.98 | - | 0.86 | 0.82 | | | Salamandra-7B | 57.55 | 0.98 | - | 0.86 | 0.82 | | |   + ACADTRAIN | 63.60 | 0.98 | - | 0.86 | 0.82 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | es → xx | GPT-mini | 54.19 | 0.99 | - | 0.86 | 0.81 | | | GPT-nano | 51.95 | 0.99 | - | 0.86 | 0.81 | | | Gemini-2 | 60.28 | 0.99 | - | 0.86 | 0.81 | | | Gemini-2.5 | 57.61 | 0.99 | - | 0.86 | 0.81 | | | Llama-3-8B | 52.12 | 0.99 | - | 0.85 | 0.80 | | | Gemma-3-27B | 57.31 | 0.99 | - | 0.86 | 0.81 | | | MADLAD-7B | 40.13 | 0.79 | - | 0.83 | 0.81 | | | Salamandra-2B | 47.84 | 0.94 | - | 0.84 | 0.80 | | |   + ACADTRAIN | 60.09 | 0.99 | - | 0.86 | 0.81 | | | Salamandra-7B | 55.65 | 0.98 | - | 0.86 | 0.80 | | |   + ACADTRAIN | 61.61 | 0.99 | - | 0.86 | 0.81 | Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found at Salamandra-7B model card. No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework MT-Lens. Note that the model has only undergone preliminary instruction tuning. We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications. Author The Language Technologies Unit from Barcelona Supercomputing Center. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje. This work has been promoted and financed by the Government of Catalonia through the Aina project. This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. Disclaimer Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence. The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

whisper-3cat-cv21-valencian

license:apache-2.0

spanish-verification-model-pkt-a

- Paper - Model Summary - Intended Uses and Limitations - How to Get Started with the Model - Training Details - Citation - Additional Information We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence. The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability. In this model card, we present Verification Model A for Spanish, available as "spanish-verification-model-pkt-a". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model B, "spanish-verification-model-pkt-b", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. Two additional models, Model C "spanish-verification-model-pkt-c" and Model D "spanish-verification-model-pkt-d" were trained with different subsets of the corpus YODAS, and they can also be used together with models A and B for better results. This model is designed for the following scenarios: Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings. Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy. Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes). Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time. No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability. Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions). Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits. Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly. To see an updated and functional version of this code, please visit NVIDIA's official repository To use this model, you may install the NVIDIA NeMo Framework: For Inference To transcribe audio in Spanish using this model, you can follow this example: The datasets used to train Model A (total 1319h) are: | Dataset | Size (h) | |-------------------------------------------------------------------------|------------| | Multilingual LibriSpeech (es) | 917h41m | | Voxforge Spanish | 49h42m | | Fisher Spanish | 131h36m | | Voxpopuli (es) | 151h55m | | CIEMPIESS BALANCE | 18h20m | | DIMEx100 LIGHT | 06h09m | | Wikipedia Spanish | 25h37m | | MediaSpeech Spanish | 10h00m | | Heroico | 16h33m | | Total hours for Model A | 1,319h00m | This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial Citation If this model contributes to your research, please cite the work: The fine-tuning process was performed during July (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

sciroshot

license:apache-2.0

spanish-verification-model-pkt-d

- Paper - Model Summary - Intended Uses and Limitations - How to Get Started with the Model - Training Details - Citation - Additional Information We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence. The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability. In this model card, we present Verification Model D for Spanish, available as "spanish-verification-model-pkt-d". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model C, "spanish-verification-model-pkt-c", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. These models can also be used together with Model A "spanish-verification-model-pkt-a" and Model B "spanish-verification-model-pkt-b" for better results. This model is designed for the following scenarios: Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings. Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy. Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes). Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time. No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability. Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions). Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits. Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly. To see an updated and functional version of this code, please visit NVIDIA's official repository To use this model, you may install the NVIDIA NeMo Framework: For Inference To transcribe audio in Spanish using this model, you can follow this example: The training data for Model D consists of 1,500 hours of Spanish speech extracted from the YODAS dataset. To ensure high-quality supervision, we applied a double-consensus filtering strategy: we only kept those utterances where the output of Model A "spanish-verification-model-pkt-a", and the output of Model B "spanish-verification-model-pkt-b" were identical. This approach allowed us to minimize noisy or ambiguous transcriptions while maintaining a large amount of diverse training material. This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial Citation If this model contributes to your research, please cite the work: The fine-tuning process was performed during August (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

spanish-verification-model-pkt-b

- Paper - Model Summary - Intended Uses and Limitations - How to Get Started with the Model - Training Details - Citation - Additional Information We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence. The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability. In this model card, we present Verification Model B for Spanish, available as "spanish-verification-model-pkt-b". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model A, "spanish-verification-model-pkt-a", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. Two additional models, Model C "spanish-verification-model-pkt-c" and Model D "spanish-verification-model-pkt-d" were trained with different subsets of the corpus YODAS, and they can also be used together with models A and B for better results. This model is designed for the following scenarios: Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings. Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy. Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes). Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time. No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability. Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions). Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits. Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly. To see an updated and functional version of this code, please visit NVIDIA's official repository To use this model, you may install the NVIDIA NeMo Framework: For Inference To transcribe audio in Spanish using this model, you can follow this example: The datasets used to train Model B (total 1,362h17m) are: | Dataset | Size (h) | |-------------------------------------------------------------------------|------------| | Common Voice 17 (es) | 485h31m | | Common Voice 17 Other (es) | 784h50m | | CIEMPIESS LIGHT | 18h25m | | Latino 40 | 06h48m | | TeleconCiencia | 28h16m | | CIEMPIESS FEM | 13h54m | | TEDx Spanish | 24h29m | | Total hours for Model B | 1,362h17m | This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial Citation If this model contributes to your research, please cite the work: The fine-tuning process was performed during July (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

spanish-verification-model-pkt-c

- Paper - Model Summary - Intended Uses and Limitations - How to Get Started with the Model - Training Details - Citation - Additional Information We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence. The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability. In this model card, we present Verification Model C for Spanish, available as "spanish-verification-model-pkt-c". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model D, "spanish-verification-model-pkt-d", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. These models can also be used together with Model A "spanish-verification-model-pkt-a" and Model B "spanish-verification-model-pkt-b" for better results. This model is designed for the following scenarios: Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings. Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy. Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes). Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time. No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability. Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions). Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits. Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly. To see an updated and functional version of this code, please visit NVIDIA's official repository To use this model, you may install the NVIDIA NeMo Framework: For Inference To transcribe audio in Spanish using this model, you can follow this example: The training data for Model C consists of 1,500 hours of Spanish speech extracted from the YODAS dataset. To ensure high-quality supervision, we applied a triple-consensus filtering strategy: we only kept those utterances where the reference transcription in YODAS, the output of Model A "spanish-verification-model-pkt-a", and the output of Model B "spanish-verification-model-pkt-b" were identical. This approach allowed us to minimize noisy or ambiguous transcriptions while maintaining a large amount of diverse training material. This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial Citation If this model contributes to your research, please cite the work: The fine-tuning process was performed during August (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

roberta-base-biomedical-es

license:apache-2.0

salamandra-7b-instruct-fp8

salamandra-2b-base-gptq

Salamandra-VL-7B-2512

faster-whisper-bsc-large-v3-cat

- Model Description - Intended Uses and Limitations - How to Get Started with the Model - Conversion Details - Citation - Additional information The "faster-whisper-bsc-large-v3-cat" is an acoustic model based on a faster-whisper version of whisper-bsc-large-v3-cat suitable for Automatic Speech Recognition in Catalan. The "faster-whisper-bsc-large-v3-cat" is the result of converting the whisper-bsc-large-v3-cat into a lighter model using a Python module called faster-whisper. This model can be used for Automatic Speech Recognition (ASR) in Catalan. The model intends to transcribe Catalan audio files to plain text without punctuation. To see an updated and functional version of this code, please visit our Notebook. For Inference To transcribe audio in Catalan using this model, you can follow this example: This model is not a direct result of training. It is a conversion of a Whisper model using faster-whisper. The procedure to create the model is as follows: Citation If this model contributes to your research, please cite the work: The conversion process was performed during May (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Abir Messaoudi. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. The conversion of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

license:apache-2.0

hubert-base-ca-2k

license:apache-2.0

RoBERTalex

license:apache-2.0

sentis-matxa-tts-wavenext-multispeaker-ca

license:gpl-3.0

salamandraTA-2B-academic

This repository contains the model SalamandraTA-2B-academic, which is a Machine Translation fine-tunning of the Salamandra2B-Instruct. This model has been obtained following the procedures shown in ACADATA: Parallel Dataset of Academic Data for Machine Translation. > [!WARNING] > DISCLAIMER: This version of Salamandra is tailored exclusively for translation tasks. Even if the Machine Translation version has been obtained after fine-tunning an instructed version the chat capabilities have not been tested. For this we refer to the used instructed version. | | | |-------------------------|:--------------| | Total Parameters | 2,253,490,176 | | Embedding Parameters | 524,288,000 | | Layers | 24 | | Hidden size | 2,048 | | Attention heads | 16 | | Context length | 8,192 | | Vocabulary size | 256,000 | | Precision | bfloat16 | | Embedding type | RoPE | | Activation Function | SwiGLU | | Layer normalization | RMS Norm | | Flash attention | ✅ | | Grouped Query Attention | ❌ | | Num. query groups | N/A | The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks. The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. SalamandraTA-2B-academic was instructed with FastChat. All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center. The accelerated partition is composed of 1,120 nodes with the following specifications: - 4x Nvidia Hopper GPUs with 64GB HBM2 memory - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores) - 4x NDR200 (BW per node 800Gb/s) - 512 GB of Main memory (DDR5) - 460GB on NVMe storage SalamandraTA-2B-academic was fine-tuned using ACAD-Train dataset which focuses on pairs involving English, Iberian Peninsula languages, and several Central European languages, namely: Asturian (ast), Catalan (ca), German (de), Greek (el), Spanish (es), English (en), Basque (eu), French (fr), Galician (gl), Italian (it), Dutch (nl) and Portuguese (pt). The dataset includes 48 unique language pairs. Since each pair is used for translation in both directions (e.g., English to Spanish and Spanish to English), this results in the 96 total supported directions. The most frequent language pairs, accounting for 96.5% of the dataset, are: - English - Spanish (en-es) - English - French (en-fr) - English - Catalan (en-ca) - Catalan - Spanish (ca-es) - Spanish - French (es-fr) - English - Portuguese (en-pt) A comprehensive list of all language pairs included in the ACAD-Train dataset. The instruction-following model uses the commonly adopted ChatML template: The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet. Using this template, each turn is preceded by a ` ` delimiter and the role of the entity (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the ` ` token. The following prompt template is recommended, since it is the one used during training: The corpus used for the instruction tuning is ACAData. For more details about the corpus construction, you can refer to the Paper. Aggregated results for the xx ↔ en and xx ↔ es translation directions in ACAD-Bench dataset. Baselines are grouped into large-scale proprietary general models, medium- to small-sized open-weights models and dedicated MMNMT models. For every metric the top-scoring system is shown in bold. For a more detailed evaluation discussion, please refer to the paper. | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | xx → en | GPT-mini | 46.03 | 1.00 | 0.60 | 0.84 | 0.77 | | | GPT-nano | 41.30 | 0.97 | 0.55 | 0.84 | 0.78 | | | Gemini-2 | 48.65 | 1.00 | 0.61 | 0.84 | 0.77 | | | Gemini-2.5 | 45.10 | 0.98 | 0.58 | 0.84 | 0.77 | | | Llama-3-8B | 43.12 | 0.99 | 0.56 | 0.83 | 0.76 | | | Gemma-3-27B | 46.37 | 0.98 | 0.59 | 0.84 | 0.77 | | | MADLAD-7B | 38.69 | 0.86 | 0.51 | 0.81 | 0.77 | | | Salamandra-2B | 37.09 | 0.92 | 0.52 | 0.82 | 0.75 | | |   + ACADTRAIN | 48.45 | 1.00 | 0.61 | 0.83 | 0.76 | | | Salamandra-7B | 45.87 | 0.99 | 0.59 | 0.83 | 0.76 | | |   + ACADTRAIN | 50.07 | 1.00 | 0.62 | 0.84 | 0.76 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | en → xx | GPT-mini | 45.01 | 0.99 | - | 0.86 | 0.82 | | | GPT-nano | 43.78 | 1.00 | - | 0.86 | 0.82 | | | Gemini-2 | 48.00 | 0.99 | - | 0.87 | 0.82 | | | Gemini-2.5 | 47.75 | 0.99 | - | 0.87 | 0.82 | | | Llama-3-8B | 39.87 | 0.99 | - | 0.85 | 0.81 | | | Gemma-3-27B | 46.29 | 0.99 | - | 0.86 | 0.82 | | | MADLAD-7B | 36.08 | 0.82 | - | 0.83 | 0.80 | | | Salamandra-2B | 32.91 | 0.90 | - | 0.83 | 0.78 | | |   + ACADTRAIN | 46.86 | 0.98 | - | 0.86 | 0.81 | | | Salamandra-7B | 42.55 | 0.98 | - | 0.86 | 0.81 | | |   + ACADTRAIN | 49.20 | 0.98 | - | 0.86 | 0.81 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | xx → es | GPT-mini | 60.60 | 0.98 | - | 0.86 | 0.82 | | | GPT-nano | 57.88 | 0.99 | - | 0.86 | 0.82 | | | Gemini-2 | 62.02 | 0.99 | - | 0.86 | 0.82 | | | Gemini-2.5 | 61.43 | 0.98 | - | 0.87 | 0.82 | | | Llama-3-8B | 55.4 | 0.98 | - | 0.86 | 0.81 | | | Gemma-3-27B | 60.71 | 0.98 | - | 0.86 | 0.82 | | | MADLAD-7B | 43.44 | 0.76 | - | 0.83 | 0.81 | | | Salamandra-2B | 50.09 | 0.92 | - | 0.85 | 0.80 | | |   + ACADTRAIN | 61.97 | 0.98 | - | 0.86 | 0.82 | | | Salamandra-7B | 57.55 | 0.98 | - | 0.86 | 0.82 | | |   + ACADTRAIN | 63.60 | 0.98 | - | 0.86 | 0.82 | | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | | es → xx | GPT-mini | 54.19 | 0.99 | - | 0.86 | 0.81 | | | GPT-nano | 51.95 | 0.99 | - | 0.86 | 0.81 | | | Gemini-2 | 60.28 | 0.99 | - | 0.86 | 0.81 | | | Gemini-2.5 | 57.61 | 0.99 | - | 0.86 | 0.81 | | | Llama-3-8B | 52.12 | 0.99 | - | 0.85 | 0.80 | | | Gemma-3-27B | 57.31 | 0.99 | - | 0.86 | 0.81 | | | MADLAD-7B | 40.13 | 0.79 | - | 0.83 | 0.81 | | | Salamandra-2B | 47.84 | 0.94 | - | 0.84 | 0.80 | | |   + ACADTRAIN | 60.09 | 0.99 | - | 0.86 | 0.81 | | | Salamandra-7B | 55.65 | 0.98 | - | 0.86 | 0.80 | | |   + ACADTRAIN | 61.61 | 0.99 | - | 0.86 | 0.81 | Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found at Salamandra-2B model card. No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework MT-Lens. Note that the model has only undergone preliminary instruction tuning. We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications. Author The Language Technologies Unit from Barcelona Supercomputing Center. Contact For further information, please send an email to . Copyright Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje. This work has been promoted and financed by the Government of Catalonia through the Aina project. This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337. Disclaimer Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence. The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

MrBERT-science

license:apache-2.0

wavenext-encodec

license:apache-2.0

AL40b-dev

whisper-3cat-balearic

- Model Description - Intended Uses and Limitations - How to Get Started with the Model - Training Details - Citation - Additional Information The "BSC-LT/whisper-3cat-balearic" is an acoustic model suitable for Automatic Speech Recognition in Balearic. It is the result of finetuning the model "openai/whisper-large-v3" on the split called "perfectmatches" of the corpus 3catparlaasr, a dataset of broadcasted Catalan TV shows manually transcribed. This particular split is comprised of 90 hours of speech data. This model can be used for Automatic Speech Recognition (ASR) in Catalan, especially in the Balearic accent. The model intends to transcribe Catalan audio files to plain text without punctuation. To use this model, you may install datasets and transformers: For Inference To transcribe audio in Catalan using this model, you can follow this example: The specific datasets used to create the model are: - Training: 3CatParla. (Soon to be published) - Validation: IB3 (Soon to be published) This model is the result of finetuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face. language: Catalan (Balearic Accent) hours of training audio: 90 hours learning rate: 1e-6 sample rate: 16000 train batch size: 32 eval batch size: 32 numtrainepochs: 20 If this model contributes to your research, please cite the work: The fine-tuning process was performed during June (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

salamandra-2b-instruct-gptq

faster-whisper-3cat-cv21-valencian

- Model Description - Intended Uses and Limitations - How to Get Started with the Model - Conversion Details - Citation - Additional Information The "BSC-LT/faster-whisper-3cat-cv21-valencian" is an acoustic model based on a faster-whisper version of BSC-LT/whisper-3cat-cv21-valencian Intended Uses and Limitations This model is the result of converting the BSC-LT/whisper-3cat-cv21-valencian into a lighter model using a Python module called faster-whisper. The model can be used for Automatic Speech Recognition (ASR) in Catalan, especially in the Valencian accent. The model intends to transcribe Catalan audio files to plain text without punctuation. For Inference To transcribe audio in Catalan using this model, you can follow this example: This model is not a direct result of training. It is a conversion of a Whisper model using faster-whisper. The procedure to create the model is as follows: If this model contributes to your research, please cite the work: The conversion process was performed during June (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center. Copyright Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center. Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337. The conversion of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5. We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

license:apache-2.0

faster-whisper-3cat-balearic

license:apache-2.0

salamandra-7b-base-fp8

mdebertaCA

license:apache-2.0

NextProcurement-NER-Spanish-UTE-Company

license:apache-2.0

salamandra-2b-instruct-fp8

wavenext-mel

license:apache-2.0

salamandra-2b-instruct-aina-hack

experimental7b-rag

salamandra-2b-base-fp8

This model is the fp8-quantized version of Salamandra-2b. The model weights are quantized from FP16 to FP8 (8-bit weights) using the FP8 quantization algorithm from NeuralMagic. Inferencing with this model can be done using VLLM. Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants, promoted and financed by the Government of Catalonia through the Aina Project and the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337. This model card corresponds to the fp8-quantized version of Salamandra-2b. The entire Salamandra family is released under a permissive Apache 2.0 license). The following example code works under ``Python 3.9.16``, ``vllm==0.6.3.post1``, ``torch==2.4.0`` and ``torchvision==0.19.0``, though it should run on any current version of the libraries. This is an example of how to create a text completion using the model: Contact For further information, please send an email to . Acknowledgements We appreciate the collaboration with IBM in this work. Specifically, the IBM team created fp8-quantized version of the Salamandra-2b model released here. Disclaimer Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence. Barcelona Supercomputing Center and International Business Machines shall not be held liable for any outcomes resulting from third-party use.