utter-project
EuroLLM-1.7B-Instruct
--- license: apache-2.0 language: - en - de - es - fr - it - pt - pl - nl - tr - sv - cs - el - hu - ro - fi - uk - sl - sk - da - lt - lv - et - bg - 'no' - ca - hr - ga - mt - gl - zh - ru - ko - ja - ar - hi base_model: - utter-project/EuroLLM-1.7B library_name: transformers ---
mHuBERT-147
EuroLLM-9B
EuroLLM-1.7B
This is the model card for the first pre-trained model of the EuroLLM series: EuroLLM-1.7B. You can also check the instruction tuned version: EuroLLM-1.7B-Instruct. - Developed by: Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université. - Funded by: European Union. - Model type: A 1.7B parameter multilingual transfomer LLM. - Language(s) (NLP): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian. - License: Apache License 2.0. The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. EuroLLM uses a standard, dense Transformer architecture: - We use grouped query attention (GQA) with 8 key-value heads, since it has been shown to increase speed at inference time while maintaining downstream performance. - We perform pre-layer normalization, since it improves the training stability, and use the RMSNorm, which is faster. - We use the SwiGLU activation function, since it has been shown to lead to good results on downstream tasks. - We use rotary positional embeddings (RoPE) in every layer, since these have been shown to lead to good performances while allowing the extension of the context length. For pre-training, we use 256 Nvidia H100 GPUs of the Marenostrum 5 supercomputer, training the model with a constant batch size of 3,072 sequences, which corresponds to approximately 12 million tokens, using the Adam optimizer, and BF16 precision. Here is a summary of the model hyper-parameters: | | | |--------------------------------------|----------------------| | Sequence Length | 4,096 | | Number of Layers | 24 | | Embedding Size | 2,048 | | FFN Hidden Size | 5,632 | | Number of Heads | 16 | | Number of KV Heads (GQA) | 8 | | Activation Function | SwiGLU | | Position Encodings | RoPE (\Theta=10,000) | | Layer Norm | RMSNorm | | Tied Embeddings | No | | Embedding Parameters | 0.262B | | LM Head Parameters | 0.262B | | Non-embedding Parameters | 1.133B | | Total Parameters | 1.657B | from transformers import AutoModelForCausalLM, AutoTokenizer modelid = "utter-project/EuroLLM-1.7B" tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained(modelid) inputs = tokenizer(text, returntensors="pt") outputs = model.generate(inputs, maxnewtokens=20) print(tokenizer.decode(outputs[0], skipspecialtokens=True)) We evaluate EuroLLM-1.7B-Instruct on several machine translation benchmarks: FLORES-200, WMT-23, and WMT-24 comparing it with Gemma-2B and Gemma-7B (also instruction tuned on EuroBlocks). The results show that EuroLLM-1.7B is substantially better than Gemma-2B in Machine Translation and competitive with Gemma-7B. Flores-200 | Model | AVG | AVG en-xx | AVG xx-en | en-ar | en-bg | en-ca | en-cs | en-da | en-de | en-el | en-es-latam | en-et | en-fi | en-fr | en-ga | en-gl | en-hi | en-hr | en-hu | en-it | en-ja | en-ko | en-lt | en-lv | en-mt | en-nl | en-no | en-pl | en-pt-br | en-ro | en-ru | en-sk | en-sl | en-sv | en-tr | en-uk | en-zh-cn | ar-en | bg-en | ca-en | cs-en | da-en | de-en | el-en | es-latam-en | et-en | fi-en | fr-en | ga-en | gl-en | hi-en | hr-en | hu-en | it-en | ja-en | ko-en | lt-en | lv-en | mt-en | nl-en | no-en | pl-en | pt-br-en | ro-en | ru-en | sk-en | sl-en | sv-en | tr-en | uk-en | zh-cn-en | |--------------------------------|------|-----------|-----------|-------|-------|-------|-------|-------|-------|-------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|----------|-------|-------|-------|-------|-------|-------|-------|----------|-------|-------|-------|-------|-------|-------|-------|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|----------|-------|-------|-------|-------|-------|-------|-------|----------| | EuroLLM-1.7B-Instruct |86.89 | 86.53 | 87.25 | 85.17 | 89.42 | 84.72 | 89.13 | 89.47 | 86.90 | 87.60 | 86.29 | 88.95 | 89.40 | 87.69 | 74.89 | 86.41 | 76.92 | 84.79 | 86.78 | 88.17 | 89.76 | 87.70 | 87.27 | 87.62 | 67.84 | 87.10 | 90.00 | 88.18 | 89.29 | 89.49 | 88.32 | 88.18 | 86.85 | 90.00 | 87.31 | 87.89 | 86.60 | 86.34 | 87.45 | 87.57 | 87.95 | 89.72 | 88.80 | 87.00 | 86.77 | 88.34 | 89.09 | 88.95 | 82.69 | 87.80 | 88.37 | 86.71 | 87.20 | 87.81 | 86.79 | 86.79 | 85.62 | 86.48 | 81.10 | 86.97 | 90.25 | 85.75 | 89.20 | 88.88 | 86.00 | 87.38 | 86.76 | 89.61 | 87.94 | | Gemma-2B-EuroBlocks | 81.59 | 78.97 | 84.21 | 76.68 | 82.73 | 83.14 | 81.63 | 84.63 | 83.15 | 79.42 | 84.05 | 72.58 | 79.73 | 84.97 | 40.50 | 82.13 | 67.79 | 80.53 | 78.36 | 84.90 | 87.43 | 82.98 | 72.29 | 68.68 | 58.55 | 83.13 | 86.15 | 82.78 | 86.79 | 83.14 | 84.61 | 78.18 | 75.37 | 80.89 | 78.38 | 84.38 | 84.35 | 83.88 | 85.77 | 86.85 | 86.31 | 88.24 | 88.12 | 84.79 | 84.90 | 82.51 | 86.32 | 88.29 | 54.78 | 86.53 | 85.83 | 85.41 | 85.18 | 86.77 | 85.78 | 84.99 | 81.65 | 81.78 | 67.27 | 85.92 | 89.07 | 84.14 | 88.07 | 87.17 | 85.23 | 85.09 | 83.95 | 87.57 | 84.77 | | Gemma-7B-EuroBlocks |85.27 | 83.90 | 86.64 | 86.38 | 87.87 | 85.74 | 84.25 | 85.69 | 81.49 | 85.52 | 86.93 | 62.83 | 84.96 | 75.34 | 84.93 | 83.91 | 86.92 | 88.19 | 86.11 | 81.73 | 80.55 | 66.85 | 85.31 | 89.36 | 85.87 | 88.62 | 88.06 | 86.67 | 84.79 | 82.71 | 86.45 | 85.19 | 86.67 | 85.77 | 86.36 | 87.21 | 88.09 | 87.17 | 89.40 | 88.26 | 86.74 | 86.73 | 87.25 | 88.87 | 88.81 | 72.45 | 87.62 | 87.86 | 87.08 | 87.01 | 87.58 | 86.92 | 86.70 | 85.10 | 85.74 | 77.81 | 86.83 | 90.40 | 85.41 | 89.04 | 88.77 | 86.13 | 86.67 | 86.32 | 89.27 | 87.92 | WMT-23 | Model | AVG | AVG en-xx | AVG xx-en | AVG xx-xx | en-de | en-cs | en-uk | en-ru | en-zh-cn | de-en | uk-en | ru-en | zh-cn-en | cs-uk | |--------------------------------|------|-----------|-----------|-----------|-------|-------|-------|-------|----------|-------|-------|-------|----------|-------| | EuroLLM-1.7B-Instruct | 82.91 | 83.20 | 81.77 | 86.82 | 81.56 | 85.23 | 81.30 | 82.47 | 83.61 | 85.03 | 84.06 | 85.25 | 81.31 | 78.83 | 79.42 | 86.82 | | Gemma-2B-EuroBlocks | 79.96 | 79.01 | 80.86 | 81.15 | 76.82 | 76.05 | 77.92 | 78.98 | 81.58 | 82.73 | 82.71 | 83.99 | 80.35 | 78.27 | 78.99 | 81.15 | | Gemma-7B-EuroBlocks | 82.76 | 82.26 | 82.70 | 85.98 | 81.37 | 82.42 | 81.54 | 82.18 | 82.90 | 83.17 | 84.29 | 85.70 | 82.46 | 79.73 | 81.33 | 85.98 | WMT-24 | Model | AVG | AVG en-xx | AVG xx-xx | en-de | en-es-latam | en-cs | en-ru | en-uk | en-ja | en-zh-cn | en-hi | cs-uk | ja-zh-cn | |---------|------|------|-------|----|---|-------|-------|--------|--------|-------|-------|-------|-----| | EuroLLM-1.7B-Instruct|79.32 | 79.32 | 79.34 | 79.42 | 80.67 | 80.55 | 78.65 | 80.12 | 82.96 | 80.60 | 71.59 | 83.48 | 75.20 | |Gemma-2B-EuroBlocks| 74.72 | 74.41 | 75.97 | 74.93 | 78.81 | 70.54 | 74.90 | 75.84 | 79.48 | 78.06 | 62.70 | 79.87 | 72.07 | |Gemma-7B-EuroBlocks| 78.67 | 78.34 | 80.00 | 78.88 | 80.47 | 78.55 | 78.55 | 80.12 | 80.55 | 78.90 | 70.71 | 84.33 | 75.66 | General Benchmarks We also compare EuroLLM-1.7B with TinyLlama-v1.1 and Gemma-2B on 3 general benchmarks: Arc Challenge and Hellaswag. For the non-english languages we use the Okapi datasets. Results show that EuroLLM-1.7B is superior to TinyLlama-v1.1 and similar to Gemma-2B on Hellaswag but worse on Arc Challenge. This can be due to the lower number of parameters of EuroLLM-1.7B (1.133B non-embedding parameters against 1.981B). Arc Challenge | Model | Average | English | German | Spanish | French | Italian | Portuguese | Chinese | Russian | Dutch | Arabic | Swedish | Hindi | Hungarian | Romanian | Ukrainian | Danish | Catalan | |--------------------|---------|---------|--------|---------|--------|---------|------------|---------|---------|-------|--------|---------|--------|-----------|----------|-----------|--------|---------| | EuroLLM-1.7B | 0.3496 | 0.4061 | 0.3464 | 0.3684 | 0.3627 | 0.3738 | 0.3855 | 0.3521 | 0.3208 | 0.3507 | 0.3045 | 0.3605 | 0.2928 | 0.3271 | 0.3488 | 0.3516 | 0.3513 | 0.3396 | | TinyLlama-v1.1 | 0.2650 | 0.3712 | 0.2524 | 0.2795 | 0.2883 | 0.2652 | 0.2906 | 0.2410 | 0.2669 | 0.2404 | 0.2310 | 0.2687 | 0.2354 | 0.2449 | 0.2476 | 0.2524 | 0.2494 | 0.2796 | | Gemma-2B | 0.3617 | 0.4846 | 0.3755 | 0.3940 | 0.4080 | 0.3687 | 0.3872 | 0.3726 | 0.3456 | 0.3328 | 0.3122 | 0.3519 | 0.2851 | 0.3039 | 0.3590 | 0.3601 | 0.3565 | 0.3516 | Hellaswag | Model | Average | English | German | Spanish | French | Italian | Portuguese | Russian | Dutch | Arabic | Swedish | Hindi | Hungarian | Romanian | Ukrainian | Danish | Catalan | |--------------------|---------|---------|--------|---------|--------|---------|------------|---------|--------|--------|---------|--------|-----------|----------|-----------|--------|---------| | EuroLLM-1.7B | 0.4744 | 0.4760 | 0.6057 | 0.4793 | 0.5337 | 0.5298 | 0.5085 | 0.5224 | 0.4654 | 0.4949 | 0.4104 | 0.4800 | 0.3655 | 0.4097 | 0.4606 | 0.436 | 0.4702 | 0.4445 | | TinyLlama-v1.1 |0.3674 | 0.6248 | 0.3650 | 0.4137 | 0.4010 | 0.3780 | 0.3892 | 0.3494 | 0.3588 | 0.2880 | 0.3561 | 0.2841 | 0.3073 | 0.3267 | 0.3349 | 0.3408 | 0.3613 | | Gemma-2B |0.4666 | 0.7165 | 0.4756 | 0.5414 | 0.5180 | 0.4841 | 0.5081 | 0.4664 | 0.4655 | 0.3868 | 0.4383 | 0.3413 | 0.3710 | 0.4316 | 0.4291 | 0.4471 | 0.4448 | EuroLLM-1.7B has not been aligned to human preferences, so the model may generate problematic outputs (e.g., hallucinations, harmful content, or false statements). Paper: EuroLLM: Multilingual Language Models for Europe
EuroLLM-22B-Instruct-2512
EuroLLM-9B-Instruct
EuroLLM-22B-2512
EuroLLM-22B-Instruct-Preview
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
EuroMoE-2.6B-A0.6B-Instruct-Preview
EuroVLM-1.7B-Preview
⚠️ PREVIEW RELEASE: This is a preview version of EuroVLM-1.7B. The model is still under development and may have limitations in performance and stability. Use with caution in production environments. This is the model card for EuroVLM-1.7B-Preview, a multimodal vision-language model based on long-context version of EuroLLM-1.7B. - Developed by: Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université. - Funded by: European Union. - Model type: A 1.7B+400M parameter multilingual multimodal transformer VLM (Vision-Language Model). - Language(s) (NLP): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian. - Modalities: Text and Vision (images). - License: Apache License 2.0. EuroVLM-1.7B is a 1.7B+400M parameter vision-language model that combines the multilingual capabilities of EuroLLM-1.7B with vision encoding components. EuroVLM-1.7B was (visually) instruction tuned on a combination of multilingual vision-language datasets, including image captioning, visual question answering, and multimodal reasoning tasks across the supported languages. EuroVLM uses a multimodal architecture combining a vision encoder with the EuroLLM language model: Language Model Component: - Based on the standard, dense Transformer architecture from EuroLLM-1.7B - Grouped query attention (GQA) with 8 key-value heads for efficient inference - Pre-layer normalization with RMSNorm for training stability - SwiGLU activation function for optimal downstream performance - Rotary positional embeddings (RoPE) in every layer - Extended context size supporting up to 32K tokens Vision Component: - Vision Transformer (ViT) encoder, based on google/siglip2-so400m-patch14-384 - Multimodal projector mapping vision representations to token embeddings - Support for high-resolution image inputs To use the model with HuggingFace's Transformers library EuroVLM-1.7B-Instruct supports a wide range of vision-language tasks across multiple languages: - Multilingual Image Captioning: Generate detailed descriptions of images in any of the supported languages - Visual Question Answering: Answer questions about image content in multilingual contexts - Visual Instruction Following: Execute complex instructions that involve both visual analysis and text generation - Multimodal Translation: Translate image captions and descriptions between supported languages - Document Understanding: Process and analyze documents, charts, and diagrams with multilingual text EuroVLM-1.7B has not been fully aligned to human preferences, so the model may generate problematic outputs in both text and image understanding contexts (e.g., hallucinations about image content, harmful content, biased interpretations, or false statements about visual information). Additional considerations for multimodal models include: - Potential biases in visual interpretation across different cultural contexts - Limitations in understanding complex visual scenes or unusual image compositions - Possible inconsistencies between visual understanding and textual generation across languages - Privacy considerations when processing images that may contain personal information Users should exercise caution and implement appropriate safety measures when deploying this model in production environments.
EuroVLM-9B-Preview
⚠️ PREVIEW RELEASE: This is a preview version of EuroVLM-9B. The model is still under development and may have limitations in performance and stability. Use with caution in production environments. This is the model card for EuroVLM-9B-Preview, a multimodal vision-language model based on long-context version of EuroLLM-9B. - Developed by: Unbabel, Instituto Superior Técnico, Instituto de Telecomunicações, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université. - Funded by: European Union. - Model type: A 9B+400M parameter multilingual multimodal transformer VLM (Vision-Language Model). - Language(s) (NLP): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian. - Modalities: Text and Vision (images). - License: Apache License 2.0. EuroVLM-9B is a 9B+400M parameter vision-language model that combines the multilingual capabilities of EuroLLM-9B with vision encoding components. EuroVLM-9B was (visually) instruction tuned on a combination of multilingual vision-language datasets, including image captioning, visual question answering, and multimodal reasoning tasks across the supported languages. EuroVLM uses a multimodal architecture combining a vision encoder with the EuroLLM language model: Language Model Component: - Based on the standard, dense Transformer architecture from EuroLLM-9B - Grouped query attention (GQA) with 8 key-value heads for efficient inference - Pre-layer normalization with RMSNorm for training stability - SwiGLU activation function for optimal downstream performance - Rotary positional embeddings (RoPE) in every layer - Extended context size supporting up to 32K tokens Vision Component: - Vision Transformer (ViT) encoder, based on google/siglip2-so400m-patch14-384 - Multimodal projector mapping vision representations to token embeddings - Support for high-resolution image inputs To use the model with HuggingFace's Transformers library EuroVLM-9B-Instruct supports a wide range of vision-language tasks across multiple languages: - Multilingual Image Captioning: Generate detailed descriptions of images in any of the supported languages - Visual Question Answering: Answer questions about image content in multilingual contexts - Visual Instruction Following: Execute complex instructions that involve both visual analysis and text generation - Multimodal Translation: Translate image captions and descriptions between supported languages - Document Understanding: Process and analyze documents, charts, and diagrams with multilingual text EuroVLM-9B has not been fully aligned to human preferences, so the model may generate problematic outputs in both text and image understanding contexts (e.g., hallucinations about image content, harmful content, biased interpretations, or false statements about visual information). Additional considerations for multimodal models include: - Potential biases in visual interpretation across different cultural contexts - Limitations in understanding complex visual scenes or unusual image compositions - Possible inconsistencies between visual understanding and textual generation across languages - Privacy considerations when processing images that may contain personal information Users should exercise caution and implement appropriate safety measures when deploying this model in production environments.
mHuBERT-147-base-2nd-iter
EuroMoE-2.6B-A0.6B-Instruct-2512
TowerVision-9B
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks, demonstrating exceptional performance across 20 languages and dialects. This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning. - Model Family: TowerVision (2B, 9B variants) - Context length: 8192 tokens - Languages: 20+ languages including European, Asian, and other language families 🌟 Try TowerVision : Project Page | Code Repository | Model | Parameters | HF Link | |-------|------------|---------| | TowerVision-2B | 2B | 🤗 utter-project/TowerVision-2B | TowerVision-9B | 9B | 🤗 utter-project/TowerVision-9B When using the model, make sure your prompt is formated correctly! Also, we recommend using bfloat16 rather than fp32/16 For processing multiple images and prompts simultaneously: Output: Model generates text in multiple languages. Model Architecture: TowerVision uses a multilingual language model based on Tower-Plus (2B and 9B parameters), paired with SigLIP2-patch14-384 vision encoder through a multimodal adapter for vision-language understanding. Recommended Precision: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models. Languages Covered: The model has been trained on 20 languages and dialects: - European languages: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk) - Asian languages: Chinese (Simplified & Traditional), Japanese, Korean, Hindi - Other languages: Russian, Ukrainian Key Strengths: - 🏆 Exceptional performance on culturally-aware benchmarks with deep understanding of cultural contexts and visual nuances - 🌐 State-of-the-art results on multimodal multilingual translation benchmarks, enabling seamless cross-lingual visual communication - 📊 Strong cross-lingual transfer capabilities across diverse vision-language tasks TowerVision models are trained on VisionBlocks, a comprehensive multilingual vision-language dataset comprising 6.31M samples across diverse categories: | Dataset | Samples | HF Link | | |---------|---------|---------|-------| | VisionBlocks | 6.31M | 🤗 utter-project/VisionBlocks | Coming Soon | Dataset Statistics - Total samples: 6.31M - Created by our team: 1.21M samples (~19%) - Human-collected/external: 5.10M samples (~81%) VisionBlocks contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data: - Chart/Plot Reasoning: DVQA, ChartQA, PlotQA, TabMWP (~405K samples) - General VQA: VQAv2, RLAIF-4V (~488K samples) - Document VQA: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples) - Reasoning/Knowledge: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples) - Multilingual/Cultural: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples) - Specialized VQA: IconQA, InfographicVQA, Stratos (~34K samples) - Counting/Math: TallyQA, PixMo-Count (~107K samples) - Vision/Text: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples) - Video/Text: LLaVA-Video collections (~1.4M samples) Collection Types: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages. TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks: TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities: ✅ Fully Supported: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian 📊 Benchmark Coverage: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks. If you find TowerVision useful in your research, please consider citing the following paper: For errors or additional questions about details in this model card, contact the research team. TowerVision builds upon the excellent work of: - LLaVA-NeXT for the foundational vision-language architecture - Tower-Plus language models for multilingual capabilities - SigLIP2 for robust vision encoding - The broader multilingual NLP and multimodal communities
EuroLLM 22B Preview
EuroMoE-2.6B-A0.6B-Preview
TowerVision-2B
TowerVideo-2B
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks, demonstrating exceptional performance across 20 languages and dialects. This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning. - Model Family: TowerVision (2B, 9B variants) - Context length: 8192 tokens - Languages: 20+ languages including European, Asian, and other language families 🌟 Try TowerVision : Project Page | Code Repository | Model | Parameters | HF Link | |-------|------------|---------| | TowerVideo-2B | 2B | 🤗 utter-project/TowerVision-2B | TowerVideo-9B | 9B | 🤗 utter-project/TowerVision-9B Output: Model generates text in multiple languages. Model Architecture: TowerVideo uses a multilingual image-language model based on Tower-Plus (2B and 9B parameters), paired with SigLIP2-patch14-384 vision encoder through a multimodal adapter for vision-language understanding. Recommended Precision: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models. Languages Covered: The model has been trained on 20 languages and dialects: - European languages: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk) - Asian languages: Chinese (Simplified & Traditional), Japanese, Korean, Hindi - Other languages: Russian, Ukrainian Key Strengths: - 🏆 Exceptional performance on culturally-aware benchmarks with deep understanding of cultural contexts and visual nuances - 📊 Strong cross-lingual transfer capabilities across diverse vision-language tasks TowerVision models are trained on a video/text subset of VisionBlocks, a comprehensive multilingual vision-language dataset comprising 6.31M samples across diverse categories: | Dataset | Samples | HF Link | | |---------|---------|---------|-------| | VisionBlocks | 6.31M | 🤗 utter-project/VisionBlocks | Coming Soon | Dataset Statistics - Total samples: 6.31M - Created by our team: 1.21M samples (~19%) - Human-collected/external: 5.10M samples (~81%) VisionBlocks contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data: - Chart/Plot Reasoning: DVQA, ChartQA, PlotQA, TabMWP (~405K samples) - General VQA: VQAv2, RLAIF-4V (~488K samples) - Document VQA: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples) - Reasoning/Knowledge: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples) - Multilingual/Cultural: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples) - Specialized VQA: IconQA, InfographicVQA, Stratos (~34K samples) - Counting/Math: TallyQA, PixMo-Count (~107K samples) - Vision/Text: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples) - Video/Text: LLaVA-Video collections (~1.4M samples) Collection Types: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages. TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks: TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities: ✅ Fully Supported: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian 📊 Benchmark Coverage: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks. If you find TowerVideo useful in your research, please consider citing the following paper: For errors or additional questions about details in this model card, contact the research team. TowerVision builds upon the excellent work of: - LLaVA-NeXT for the foundational vision-language architecture - Tower-Plus language models for multilingual capabilities - SigLIP2 for robust vision encoding - The broader multilingual NLP and multimodal communities
TowerVideo 9B
TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks, demonstrating exceptional performance across 20 languages and dialects. This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning. - Model Family: TowerVision (2B, 9B variants) - Context length: 8192 tokens - Languages: 20+ languages including European, Asian, and other language families 🌟 Try TowerVision : Project Page | Code Repository | Model | Parameters | HF Link | |-------|------------|---------| | TowerVideo-2B | 2B | 🤗 utter-project/TowerVision-2B | TowerVideo-9B | 9B | 🤗 utter-project/TowerVision-9B Output: Model generates text in multiple languages. Model Architecture: TowerVideo uses a multilingual image-language model based on Tower-Plus (2B and 9B parameters), paired with SigLIP2-patch14-384 vision encoder through a multimodal adapter for vision-language understanding. Recommended Precision: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models. Languages Covered: The model has been trained on 20 languages and dialects: - European languages: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk) - Asian languages: Chinese (Simplified & Traditional), Japanese, Korean, Hindi - Other languages: Russian, Ukrainian Key Strengths: - 🏆 Exceptional performance on culturally-aware benchmarks with deep understanding of cultural contexts and visual nuances - 📊 Strong cross-lingual transfer capabilities across diverse vision-language tasks TowerVision models are trained on a video/text subset of VisionBlocks, a comprehensive multilingual vision-language dataset comprising 6.31M samples across diverse categories: | Dataset | Samples | HF Link | | |---------|---------|---------|-------| | VisionBlocks | 6.31M | 🤗 utter-project/VisionBlocks | Coming Soon | Dataset Statistics - Total samples: 6.31M - Created by our team: 1.21M samples (~19%) - Human-collected/external: 5.10M samples (~81%) VisionBlocks contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data: - Chart/Plot Reasoning: DVQA, ChartQA, PlotQA, TabMWP (~405K samples) - General VQA: VQAv2, RLAIF-4V (~488K samples) - Document VQA: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples) - Reasoning/Knowledge: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples) - Multilingual/Cultural: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples) - Specialized VQA: IconQA, InfographicVQA, Stratos (~34K samples) - Counting/Math: TallyQA, PixMo-Count (~107K samples) - Vision/Text: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples) - Video/Text: LLaVA-Video collections (~1.4M samples) Collection Types: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages. TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks: TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities: ✅ Fully Supported: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian 📊 Benchmark Coverage: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks. If you find TowerVideo useful in your research, please consider citing the following paper: For errors or additional questions about details in this model card, contact the research team. TowerVision builds upon the excellent work of: - LLaVA-NeXT for the foundational vision-language architecture - TowerVision-9B vision-language model with multilingual capabilities - SigLIP2 for robust vision encoding - The broader multilingual NLP and multimodal communities