Google's AI research division, creators of Gemini and PaLM
electra-base-discriminator
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset. For a detailed description and experimental results, please refer to our paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).
vit-base-patch16-224
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
gemma-3-1b-it
siglip-so400m-patch14-384
--- license: apache-2.0 tags: - vision widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ---
vit-base-patch16-224-in21k
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification). By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
gemma-3-12b-it
--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-12b-pt ---
flan-t5-base
--- language: - en - fr - ro - de - multilingual
siglip-base-patch16-224
--- license: apache-2.0 tags: - vision widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ---
gemma-3-4b-it
--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-4b-pt ---
paligemma2-3b-pt-224
--- library_name: transformers license: gemma pipeline_tag: image-text-to-text extra_gated_heading: Access PaliGemma on Hugging Face extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---
gemma-3-27b-it
--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-27b-pt ---
fnet-base
--- language: en tags: - fnet license: apache-2.0 datasets: - c4 ---
siglip2-base-patch16-naflex
--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---
owlv2-base-patch16-ensemble
--- license: apache-2.0 tags: - vision - zero-shot-object-detection inference: false ---
t5-v1_1-xxl
--- language: en datasets: - c4
flan-t5-large
--- language: - en - fr - ro - de - multilingual
flan-t5-small
--- language: - en - fr - ro - de - multilingual
embeddinggemma-300m
--- license: gemma pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - text-embeddings-inference extra_gated_heading: Access EmbeddingGemma on Hugging Face extra_gated_prompt: To access EmbeddingGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_bu
medgemma-4b-it
--- license: other license_name: health-ai-developer-foundations license_link: https://developers.google.com/health-ai-developer-foundations/terms library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access MedGemma on Hugging Face extra_gated_prompt: >- To access MedGemma on Hugging Face, you're required to review and agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms). To do this, please ensur
siglip2-so400m-patch14-384
--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---
mt5-small
--- language: - multilingual - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - hi - hmn - ht - hu - hy - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - no - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und -
t5gemma-s-s-prefixlm
--- license: gemma library_name: transformers pipeline_tag: text2text-generation extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/t5gemma-s-s-prefixlm ---
siglip2-so400m-patch16-naflex
--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---
madlad400-3b-mt
--- license: apache-2.0 language: - multilingual - en - ru - es - fr - de - it - pt - pl - nl - vi - tr - sv - id - ro - cs - zh - hu - ja - th - fi - fa - uk - da - el - "no" - bg - sk - ko - ar - lt - ca - sl - he - et - lv - hi - sq - ms - az - sr - ta - hr - kk - is - ml - mr - te - af - gl - fil - be - mk - eu - bn - ka - mn - bs - uz - ur - sw - yue - ne - kn - kaa - gu - si - cy - eo - la - hy - ky - tg - ga - mt - my - km - tt - so - ku - ps - pa - rw - lo - ha - dv - fy - lb - ckb - mg
mobilebert-uncased
--- language: en thumbnail: https://huggingface.co/front/thumbnails/google.png
gemma-2-2b-it
--- license: gemma library_name: transformers pipeline_tag: text-generation extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license tags: - conversational base_model: google/gemma-2-2b ---
gemma-2b
--- library_name: transformers new_version: google/gemma-2-2b license: gemma extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---
vit-hybrid-base-bit-384
--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k ---
paligemma-3b-mix-224
--- library_name: transformers license: gemma pipeline_tag: image-text-to-text extra_gated_heading: Access PaliGemma on Hugging Face extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---
flan-t5-xl
--- language: - en - fr - ro - de - multilingual
owlv2-large-patch14-ensemble
The OWLv2 model (short for Open-World Localization) was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL-ViT, is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. The model uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.
siglip2-base-patch16-224
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You...
t5-v1_1-base
ddpm-cifar10-32
gemma-3-270m-it
--- base_model: google/gemma-3-270m license: gemma tags: - gemma3 - gemma - google pipeline_tag: text-generation library_name: transformers extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---
owlvit-base-patch32
The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.
flan-t5-xxl
0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Bias, Risks, and Limitations 5. Training Details 6. Evaluation 7. Environmental Impact 8. Citation If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract : > Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the T5 model card. - Model type: Language model - Language(s) (NLP): English, German, French - License: Apache 2.0 - Related Models: All FLAN-T5 Checkpoints - Original Checkpoints: All Original FLAN-T5 Checkpoints - Resources for more information: - Research paper - GitHub Repo - Hugging Face FLAN-T5 Docs (Similar to T5) Find below some example scripts on how to use the model in `transformers`: Running the model on a GPU using different precisions The authors write in the original paper's model card that: > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models The information below in this section are copied from the model's official model card: > Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application. > Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. > Flan-T5 has not been tested in real world applications. > Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech. The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2): According to the model card from the original paper: > These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size. The model has been trained on TPU v3 or TPU v4 pods, using `t5x` codebase together with `jax`. The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: For full results for FLAN-T5-XXL, see the research paper, Table 3. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4. - Hours used: More information needed - Cloud Provider: GCP - Compute Region: More information needed - Carbon Emitted: More information needed
gemma-2-2b
vit-base-patch16-384
siglip2-giant-opt-patch16-384
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). Here is how to use this model to perform zero-shot image classification: You can encode an image using the Vision Tower like so: For more code examples, we refer to the siglip documentation. SigLIP 2 adds some clever training objectives on top of SigLIP: 1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023). Evaluation of SigLIP 2 is shown below (taken from the paper).
electra-small-discriminator
gemma-3-270m
mt5-large
bert_uncased_L-2_H-128_A-2
gemma-3n-E2B-it
pegasus-xsum
--- language: en tags: - summarization model-index: - name: google/pegasus-xsum results: - task: type: summarization name: Summarization dataset: name: samsum type: samsum config: samsum split: train metrics: - name: ROUGE-1 type: rouge value: 21.8096 verified: true - name: ROUGE-2 type: rouge value: 4.2525 verified: true - name: ROUGE-L type: rouge value: 17.4469 verified: true - name: ROUGE-LSUM type: rouge value: 18.8907 verified: true - name: loss type: loss value: 3.0317161083221436 verifie
gemma-2-9b-it
gemma-3-12b-pt
siglip2-so400m-patch16-256
metricx-24-hybrid-xxl-v2p6-bfloat16
This is not an officially supported Google product. > ℹ️ For the full-precision (float32) variant of this model, see MetricX-24 (XXL). GitHub repository: https://github.com/google-research/metricx The repository contains the code for running inference on MetricX-24 models, a family of models for automatic evaluation of translations that were proposed in the WMT'24 Metrics Shared Task submission MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task. The models were trained in T5X and then converted for use in PyTorch. There are 3 MetricX-24 models available on Hugging Face that vary in the number of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all hybrid models that can do both reference-based and reference-free (also known as quality estimation, or QE) inference: MetricX-24-Hybrid-XXL MetricX-24-Hybrid-XL MetricX-24-Hybrid-Large We recommend using the XXL model versions for the best agreement with human judgments of translation quality, the Large versions for best speed, and the XL for an intermediate use case. The MetricX-24 models available here are most similar to the primary submission to the WMT'24 Metrics Shared Task. They are initialized with mT5, then fine-tuned on a combination of direct assessment and MQM data from WMT'15-'22. However, we made a couple of small changes that make these models different from the WMT'24 submissions. First, the metric scores get automatically clipped at 0 and 25, to ensure they are strictly in the [0, 25] range, as due to the nature of regression models, the scores could otherwise sometimes fall outside the range. Second, we included one additional type of synthetic training examples that weren't ready in time for the official submission. These are examples of perfect translations of multi-sentence segments, generated from the MQM data from WMT'20-'22. The purpose of this category of synthetic data is to reduce the model's bias against longer translations when the source segment and/or reference are also long. For comparison with the submissions to WMT'24 Metrics Shared Task, we provide an overview of the system- and segment-level correlation scores between the MetricX-24 scores and MQM ratings of translation quality, as calculated on the shared task's test sets: | Model | Sys-Level SPA (en-de) | Seg-Level Acc (en-de) | Sys-Level SPA (en-es) | Seg-Level Acc (en-es) | Sys-Level SPA (ja-zh) | Seg-Level Acc (ja-zh) | | -------------------------- | ----- | ----- | ----- | ----- | ----- | ----- | | MetricX-24-Hybrid-XXL | 0.865 | 0.543 | 0.785 | 0.685 | 0.878 | 0.541 | | MetricX-24-Hybrid-XL | 0.884 | 0.522 | 0.806 | 0.683 | 0.859 | 0.528 | | MetricX-24-Hybrid-Large | 0.879 | 0.511 | 0.795 | 0.686 | 0.845 | 0.514 | | MetricX-24-Hybrid-QE-XXL | 0.884 | 0.525 | 0.789 | 0.685 | 0.863 | 0.527 | | MetricX-24-Hybrid-QE-XL | 0.879 | 0.502 | 0.774 | 0.683 | 0.849 | 0.509 | | MetricX-24-Hybrid-QE-Large | 0.809 | 0.490 | 0.762 | 0.684 | 0.847 | 0.508 | Below are the above correlation scores averaged, as used in the shared task to determine the final ranking of the submissions: | Model | Average Correlation | | -------------------------- | ----- | | MetricX-24-Hybrid-XXL | 0.716 | | MetricX-24-Hybrid-XL | 0.714 | | MetricX-24-Hybrid-Large | 0.705 | | MetricX-24-Hybrid-QE-XXL | 0.712 | | MetricX-24-Hybrid-QE-XL | 0.699 | | MetricX-24-Hybrid-QE-Large | 0.683 | NOTE: Since MetricX-24 models are hybrid models, MetricX-24-\ and MetricX-24-QE-\ correspond to the same model, evaluated with and without the references, respectively. If you use MetricX-24 in your research, please cite the following publication:
siglip2-base-patch16-512
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). Here is how to use this model to perform zero-shot image classification: You can encode an image using the Vision Tower like so: For more code examples, we refer to the siglip documentation. SigLIP 2 adds some clever training objectives on top of SigLIP: 1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023). Evaluation of SigLIP 2 is shown below (taken from the paper).
gemma-3-12b-it-qat-q4_0-gguf
gemma-3-27b-it-qat-q4_0-gguf
mt5-base
--- language: - multilingual - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - hi - hmn - ht - hu - hy - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - no - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und -
videoprism-base-f16r288
siglip-so400m-patch14-224
gemma-2b-it
--- library_name: transformers license: gemma new_version: google/gemma-2-2b-it widget: - messages: - role: user content: How does the brain work? inference: parameters: max_new_tokens: 200 extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Ackn
siglip2-base-patch16-256
siglip2-large-patch16-256
byt5-small
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-small` significantly outperforms mt5-small on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
gemma-3-4b-pt
vivit-b-16x2-kinetics400
ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository. Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team. ViViT is an extension of the Vision Transformer (ViT) to video. The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.
gemma-3-1b-pt
vit-large-patch16-224
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at the same resolution, 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
mobilenet_v2_1.0_224
siglip2-so400m-patch16-512
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You...
siglip2-so400m-patch16-384
byt5-base
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-base` significantly outperforms mt5-base on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
siglip2-so400m-patch14-224
t5-v1_1-xl
T5 Version 1.1 includes the following improvements compared to the original T5 model- GEGLU activation in feed-forward hidden layer, rather than ReLU - see here. - Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning. - Pre-trained on C4 only without mixing in the downstream tasks. - no parameter sharing between embedding and classifier layer - "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger `dmodel` and smaller `numheads` and `dff`. Note: T5 Version 1.1 was only pre-trained on C4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Pretraining Dataset: C4 Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
gemma-3n-E4B-it
long-t5-tglobal-base
LongT5 (transient-global attention, base-sized model) LongT5 model pre-trained on English language. The model was introduced in the paper LongT5: Efficient Text-To-Text Transformer for Long Sequences by Guo et al. and first released in the LongT5 repository. All the model architecture and configuration can be found in Flaxformer repository which uses another Google research project repository T5x. Disclaimer: The team releasing LongT5 did not write a model card for this model so this model card has been written by the Hugging Face team. Model description LongT5 model is an encoder-decoder transformer pre-trained in a text-to-text denoising generative setting (Pegasus-like generation pre-training). LongT5 model is an extension of T5 model, and it enables using one of the two different efficient attention mechanisms - (1) Local attention, or (2) Transient-Global attention. The usage of attention sparsity patterns allows the model to efficiently handle input sequence. LongT5 is particularly effective when fine-tuned for text generation (summarization, question answering) which requires handling long input sequences (up to 16,384 tokens). The model is mostly meant to be fine-tuned on a supervised dataset. See the model hub to look for fine-tuned versions on a task that interests you.
byt5-large
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-large` significantly outperforms mt5-large on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
bigbird-roberta-base
paligemma-3b-pt-224
owlv2-base-patch16
The OWLv2 model (short for Open-World Localization) was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL-ViT, is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. The model uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.
gemma-3n-E2B-it-litert-lm
gemma-2-9b
owlvit-base-patch16
The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.
gemma-7b
tapas-base-finetuned-wtq
TAPAS base model fine-tuned on WikiTable Questions (WTQ) This model has 2 versions which can be used. The default version corresponds to the `tapaswtqwikisqlsqaintermasklmbasereset` checkpoint of the original Github repository. This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned in a chain on SQA, WikiSQL and finally WTQ. It uses relative position embeddings (i.e. resetting the position index at every cell of the table). The other (non-default) version which can be used is: - `noreset`, which corresponds to `tapaswtqwikisqlsqaintermasklmbase` (intermediate pre-training, absolute position embeddings). Disclaimer: The team releasing TAPAS did not write a model card for this model so this model card has been written by the Hugging Face team and contributors. Size | Reset | Dev Accuracy | Link -------- | --------| -------- | ---- LARGE | noreset | 0.5062 | tapas-large-finetuned-wtq (with absolute pos embeddings) LARGE | reset | 0.5097 | tapas-large-finetuned-wtq BASE | noreset | 0.4525 | tapas-base-finetuned-wtq (with absolute pos embeddings) BASE | reset | 0.4638 | tapas-base-finetuned-wtq MEDIUM | noreset | 0.4324 | tapas-medium-finetuned-wtq (with absolute pos embeddings) MEDIUM | reset | 0.4324 | tapas-medium-finetuned-wtq SMALL | noreset | 0.3681 | tapas-small-finetuned-wtq (with absolute pos embeddings) SMALL | reset | 0.3762 | tapas-small-finetuned-wtq MINI | noreset | 0.2783 | tapas-mini-finetuned-wtq (with absolute pos embeddings) MINI | reset | 0.2854 | tapas-mini-finetuned-wtq TINY | noreset | 0.0823 | tapas-tiny-finetuned-wtq (with absolute pos embeddings) TINY | reset | 0.1039 | tapas-tiny-finetuned-wtq TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a (flattened) table and associated context, the model randomly masks 15% of the words in the input, then runs the entire (partially masked) sequence through the model. The model then has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of a table and associated text. - Intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating a balanced dataset of millions of syntactically created training examples. Here, the model must predict (classify) whether a sentence is supported or refuted by the contents of a table. The training examples are created based on synthetic as well as counterfactual statements. This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table. Fine-tuning is done by adding a cell selection head and aggregation head on top of the pre-trained model, and then jointly train these randomly initialized classification heads with the base model on SQa, WikiSQL and finally WTQ. You can use this model for answering questions related to a table. For code examples, we refer to the documentation of TAPAS on the HuggingFace website. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: The authors did first convert the WTQ dataset into the format of SQA using automatic conversion scripts. The model was fine-tuned on 32 Cloud TPU v3 cores for 50,000 steps with maximum sequence length 512 and batch size of 512. In this setup, fine-tuning takes around 10 hours. The optimizer used is Adam with a learning rate of 1.93581e-5, and a warmup ratio of 0.128960. An inductive bias is added such that the model only selects cells of the same column. This is reflected by the `selectonecolumn` parameter of `TapasConfig`. See the paper for more details (tables 11 and 12).
paligemma2-28b-pt-896
siglip2-base-patch32-256
owlvit-large-patch14
The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.
siglip2-large-patch16-384
medgemma-27b-text-it
siglip-large-patch16-384
siglip-large-patch16-256
siglip-base-patch16-256
paligemma2-3b-ft-docci-448
paligemma2-10b-pt-224
t5gemma-9b-9b-ul2
vit-large-patch16-224-in21k
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification). By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model to embed images, but it's mostly intended to be fine-tuned on a downstream task. Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
efficientnet-b0
siglip2-base-patch16-384
vit-large-patch16-384
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
pegasus-large
t5gemma-b-b-ul2
siglip2-large-patch16-512
videoprism-lvt-base-f16r288
t5-efficient-tiny
T5-Efficient-TINY is a variation of Google's original T5 following the T5 model architecture. It is a pretrained-only checkpoint and was released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. In a nutshell, the paper indicates that a Deep-Narrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count. > We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased > before considering any other forms of uniform scaling across other dimensions. This is largely due to > how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a > tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, > a tall base model might also generally more efficient compared to a large model. We generally find > that, regardless of size, even if absolute performance might increase as we continue to stack layers, > the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 > layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., > params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, > FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to > consider. To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block. This model checkpoint - t5-efficient-tiny - is of model type Tiny with no variations. It has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16). A summary of the original T5 model architectures can be seen here: | Model | nl (el/dl) | ff | dm | kv | nh | #Params| | ----| ---- | ---- | ---- | ---- | ---- | ----| | Tiny | 4/4 | 1024 | 256 | 32 | 4 | 16M| | Mini | 4/4 | 1536 | 384 | 32 | 8 | 31M| | Small | 6/6 | 2048 | 512 | 32 | 8 | 60M| | Base | 12/12 | 3072 | 768 | 64 | 12 | 220M| | Large | 24/24 | 4096 | 1024 | 64 | 16 | 738M| | Xl | 24/24 | 16384 | 1024 | 128 | 32 | 3B| | XXl | 24/24 | 65536 | 1024 | 128 | 128 | 11B| | Abbreviation | Definition | | ----| ---- | | nl | Number of transformer blocks (depth) | | dm | Dimension of embedding vector (output vector of transformers block) | | kv | Dimension of key/value projection matrix | | nh | Number of attention heads | | ff | Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) | | el | Number of transformer blocks in the encoder (encoder depth) | | dl | Number of transformer blocks in the decoder (decoder depth) | | sh | Signifies that attention heads are shared | | skv | Signifies that key-values projection matrices are tied | If a model checkpoint has no specific, el or dl than both the number of encoder- and decoder layers correspond to nl. The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective. Note: This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks. You can follow on of the following examples on how to fine-tune the model: - Summarization - Question Answering - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.
siglip-base-patch16-256-multilingual
umt5-xxl
UMT5 is pretrained on the an updated version of mC4 corpus, covering 107 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu. Note: UMT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Paper: UniMax, Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining Authors: by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.
bert_uncased_L-4_H-256_A-4
This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12
vit-base-patch32-224-in21k
gemma-2-2b-jpn-it
medsiglip-448
metricx-24-hybrid-xl-v2p6
siglip-base-patch16-384
bert_uncased_L-12_H-512_A-8
This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12
medgemma-27b-it
electra-small-generator
bigbird-pegasus-large-arxiv
vit-large-patch32-384
bert_uncased_L-4_H-512_A-8
This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12
vit-huge-patch14-224-in21k
gemma-3n-E4B-it-litert-lm
muril-base-cased
timesfm-2.0-500m-pytorch
mobilenet_v1_0.75_192
bert_uncased_L-2_H-512_A-8
vit-base-patch32-384
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384. Images are presented to the model as a sequence of fixed-size patches (resolution 32x32), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
t5-v1_1-small
siglip-base-patch16-512
efficientnet-b7
EfficientNet model trained on ImageNet-1k at resolution 600x600. It was introduced in the paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks by Mingxing Tan and Quoc V. Le, and first released in this repository. Disclaimer: The team releasing EfficientNet did not write a model card for this model so this model card has been written by the Hugging Face team. EfficientNet is a mobile friendly pure convolutional model (ConvNet) that proposes a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
paligemma2-3b-mix-224
gemma-3-27b-pt
umt5-xl
UMT5 is pretrained on the an updated version of mC4 corpus, covering 107 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu. Note: UMT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Paper: UniMax, Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining Authors: by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.
pegasus-cnn_dailymail
shieldgemma-2-4b-it
ddpm-ema-celebahq-256
ddpm-celebahq-256
t5gemma-2b-2b-ul2
gemma-3n-E4B
electra-large-discriminator
t5gemma-2b-2b-prefixlm-it
gemma-3-4b-it-qat-q4_0-unquantized
muril-large-cased
paligemma2-3b-mix-448
t5-v1_1-large
txgemma-27b-predict
pix2struct-base
bert_uncased_L-12_H-768_A-12
canine-c
paligemma-3b-mix-448
tapas-large-finetuned-wtq
bert_uncased_L-8_H-512_A-8
efficientnet-b1
codegemma-2b
shieldgemma-2b
siglip2-giant-opt-patch16-256
canine-s
txgemma-2b-predict
owlv2-large-patch14
t5-efficient-small
vaultgemma-1b
codegemma-7b
--- library_name: transformers license: gemma license_link: https://ai.google.dev/gemma/terms extra_gated_heading: Access CodeGemma on Hugging Face extra_gated_prompt: To access CodeGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---
shieldgemma-9b
tapas-base
pix2struct-textcaps-base
paligemma2-3b-pt-448
deplot
vit-large-patch32-224-in21k
gemma-3-4b-it-qat-q4_0-gguf
bert_uncased_L-4_H-128_A-2
madlad400-10b-mt
efficientnet-b2
owlv2-base-patch16-finetuned
medgemma-4b-pt
timesfm-1.0-200m-pytorch
t5_xxl_true_nli_mixture
This is an NLI model based on T5-XXL that predicts a binary label ('1' - Entailment, '0' - No entailment). It is trained similarly to the NLI model described in the TRUE paper (Honovich et al, 2022), but using the following datasets instead of ANLI: - SNLI (Bowman et al., 2015) - MNLI (Williams et al., 2018) - Fever (Thorne et al., 2018) - Scitail (Khot et al., 2018) - PAWS (Zhang et al. 2019) - VitaminC (Schuster et al., 2021) The input format for the model is: "premise: PREMISETEXT hypothesis: HYPOTHESISTEXT". If you use this model for a research publication, please cite the TRUE paper (using the bibtex entry below) and the dataset papers mentioned above.
pix2struct-docvqa-base
embeddinggemma-300m-qat-q8_0-unquantized
bit-50
The BiT model was proposed in Big Transfer (BiT): General Visual Representation Learning by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning. Disclaimer: The team releasing ResNet did not write a model card for this model so this model card has been written by the Hugging Face team. Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
bert_uncased_L-6_H-768_A-12
bigbird-roberta-large
codegemma-7b-it
rembert
bert_uncased_L-12_H-256_A-4
roberta2roberta_L-24_discofuse
gemma-3-1b-it-qat-q4_0-gguf
metricx-24-hybrid-large-v2p6
gemma-3n-E2B
tapas-base-finetuned-tabfact
reformer-crime-and-punishment
paligemma2-10b-mix-448
efficientnet-b4
bert_for_seq_generation_L-24_bbc_encoder
t5gemma-2b-2b-ul2-it
long-t5-tglobal-large
metricx-23-xl-v2p0
gemma-3-1b-it-qat-q4_0-unquantized
siglip-so400m-patch16-256-i18n
tapas-base-finetuned-sqa
long-t5-local-base
t5gemma-b-b-prefixlm-it
pix2struct-ai2d-base
ddpm-ema-church-256
mobilenet_v1_1.0_224
long-t5-tglobal-xl
gemma-3-12b-it-qat-int4-unquantized
deeplabv3_mobilenet_v2_1.0_513
videoprism-lvt-large-f8r288
ddpm-cat-256
t5gemma-9b-2b-ul2-it
metricx-24-hybrid-large-v2p6-bfloat16
efficientnet-b3
gemma-3-27b-it-qat-q4_0-unquantized
ddpm-church-256
switch-base-16
tapas-tiny-finetuned-sqa
pix2struct-large
efficientnet-b5
t5gemma-s-s-ul2
pegasus-x-base
bigbird-pegasus-large-pubmed
long-t5-local-large
paligemma-3b-pt-448
paligemma2-10b-mix-224
reformer-enwik8
mobilenet_v2_1.4_224
gemma-3-12b-it-qat-q4_0-unquantized
embeddinggemma-300m-qat-q4_0-unquantized
bigbird-base-trivia-itc
metricx-24-hybrid-xl-v2p6-bfloat16
gemma-3-270m-it-qat-q4_0-unquantized
bert_uncased_L-2_H-256_A-4
DiarizationLM-8b-Fisher-v2
paligemma2-3b-pt-896
t5gemma-l-l-ul2-it
t5-xl-lm-adapt
gemma-3-270m-qat-q4_0-unquantized
bert2bert_L-24_wmt_de_en
t5gemma-b-b-prefixlm
gemma-3-4b-it-qat-int4-unquantized
Gemma 7b Aps It
bert_uncased_L-8_H-256_A-4
vivit-b-16x2
txgemma-9b-chat
roberta2roberta_L-24_bbc
paligemma2-10b-ft-docci-448
hear-pytorch
tapas-tiny-finetuned-wtq
txgemma-27b-chat
madlad400-7b-mt
txgemma-9b-predict
paligemma2-28b-mix-224
tapas-large
paligemma-3b-ft-gqa-224
pegasus-multi_news
electra-base-generator
bert_uncased_L-12_H-128_A-2
t5-small-ssm-nq
paligemma-3b-ft-nlvr2-448
mobilenet_v2_0.35_96
bigbird-pegasus-large-bigpatent
tapas-base-finetuned-wikisql-supervised
datagemma-rig-27b-it
tapas-small-finetuned-sqa
t5gemma-2b-2b-prefixlm
metricx-23-qe-xl-v2p0
t5-large-lm-adapt
t5gemma-ml-ml-ul2-it
pegasus-pubmed
owlv2-large-patch14-finetuned
gemma-3-1b-it-qat-int4-unquantized
t5-small-lm-adapt
pegasus-x-base-arxiv
pegasus-arxiv
pix2struct-chartqa-base
umt5-small
mobilenet_v2_0.75_160
timesfm-1.0-200m
t5-base-lm-adapt
t5gemma-9b-9b-prefixlm-it
matcha-chartqa
t5gemma-s-s-ul2-it
efficientnet-b6
t5gemma-9b-9b-ul2-it
t5-xxl-lm-adapt
t5gemma-xl-xl-prefixlm-it
Derm Foundation
pix2struct-docvqa-large
bert_uncased_L-6_H-256_A-4
metricx-23-large-v2p0
videoprism-large-f8r288
Metricx 24 Hybrid Xxl V2p6
paligemma-3b-ft-cococap-448
Magenta Realtime
- Blog Post - Paper - Colab Demo - Repository - HuggingFace Magenta RealTime is offered under a combination of licenses: the codebase is licensed under Apache 2.0, and the model weights under Creative Commons Attribution 4.0 International. In addition, we specify the following usage terms: Use these materials responsibly and do not generate content, including outputs, that infringe or violate the rights of others, including rights in copyrighted content. Google claims no rights in outputs you generate using Magenta RealTime. You and your users are solely responsible for outputs and their subsequent uses. Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses. You are solely responsible for determining the appropriateness of using, reproducing, modifying, performing, displaying or distributing the software and materials, and any outputs, and assume any and all risks associated with your use or distribution of any of the software and materials, and any outputs, and your exercise of rights and permissions under the licenses. Magenta RealTime is an open music generation model from Google built from the same research and technology used to create MusicFX DJ and Lyria RealTime. Magenta RealTime enables the continuous generation of musical audio steered by a text prompt, an audio example, or a weighted combination of multiple text prompts and/or audio examples. Its relatively small size makes it possible to deploy in environments with limited resources, including live performance settings or freely available Colab TPUs. Magenta RealTime is composed of three components: SpectroStream, MusicCoCa, and an LLM. A full technical report with more details on each component is here. 1. SpectroStream is a discrete audio codec that converts stereo 48kHz audio into tokens, building on the SoundStream RVQ codec from Zeghidour+ 21 1. MusicCoCa is a contrastive-trained model capable of embedding audio and text into a common embedding space, building on Yu+ 22 and Huang+ 22. 1. An encoder-decoder Transformer LLM generates audio tokens given context audio tokens and a tokenized MusicCoCa embedding, building on the MusicLM method from Agostinelli+ 23 - SpectroStream RVQ codec: Tokenizes high-fidelity music audio - Encoder input / Decoder output: Music audio waveforms, 48kHz stereo - Encoder output / Decoder input: Discrete audio tokens, 25Hz frame rate, 64 RVQ depth, 10 bit codes, 16kbps - MusicCoCa: Joint embeddings of text and music audio - Input: Music audio waveforms, 16kHz mono, or text representation of music style e.g. "heavy metal" - Output: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit codes - Encoder-decoder Transformer LLM: Generates audio tokens given context and style - Encoder Input: (Context, 1000 tokens) 10s of audio context tokens w/ 4 RVQ depth, (Style, 6 tokens) Quantized MusicCoCa style embedding - Decoder Output: (Generated, 800 tokens) 2s of audio w/ 16 RVQ depth Music generation models, in particular ones targeted for continuous real-time generation and control, have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Interactive Music Creation - Live Performance / Improvisation: These models can be used to generate music in a live performance setting, controlled by performers manipulating style embeddings or the audio context - Accessible Music-Making & Music Therapy: People with impediments to using traditional instruments (skill gaps, disabilities, etc.) can participate in communal jam sessions or solo music creation. - Video Games: Developers can create a custom soundtrack for users in real-time based on their actions and environment. - Research - Transfer learning: Researchers can leverage representations from MusicCoCa and Magenta RT to recognize musical information. - Personalization - Musicians can finetune models with their own catalog to customize the model to their style (fine tuning support coming soon). - Education - Exploring Genres, Instruments, and History: Natural language prompting enables users to quickly learn about and experiment with musical concepts. See our Terms of Use above for usage we consider out of scope. Magenta RT supports the real-time generation and steering of instrumental music. The purpose and intention of this capability is to foster the development of new real-time, interactive co-creation workflows that seamlessly integrate with human-centered forms of musical creativity. Every AI music generation model, including Magenta RT, carries a risk of impacting the economic and cultural landscape of music. We aim to mitigate these risks through the following avenues: - Prioritizing human-AI interaction as fundamental in the design of Magenta RT. - Distributing the model under a terms of service that prohibit developers from generating outputs that infringe or violate the rights of others, including rights in copyrighted content. - Training on primarily instrumental data. With specific prompting, this model has been observed to generate some vocal sounds and effects, though those vocal sounds and effects tend to be non-lexical. Coverage of broad musical styles. Magenta RT's training data primarily consists of Western instrumental music. As a consequence, Magenta RT has incomplete coverage of both vocal performance and the broader landscape of rich musical traditions worldwide. For real-time generation with broader style coverage, we refer users to our Lyria RealTime API. Vocals. While the model is capable of generating non-lexical vocalizations and humming, it is not conditioned on lyrics and is unlikely to generate actual words. However, there remains some risk of generating explicit or culturally-insensitive lyrical content. Latency. Because the Magenta RT LLM operates on two second chunks, user inputs for the style prompt may take two or more seconds to influence the musical output. Limited context. Because the Magenta RT encoder has a maximum audio context window of ten seconds, the model is unable to directly reference music that has been output earlier than that. While the context is sufficient to enable the model to create melodies, rhythms, and chord progressions, the model is not capable of automatically creating longer-term song structures. At the time of release, Magenta RealTime represents the only open weights model supporting real-time, continuous musical audio generation. It is designed specifically to enable live, interactive musical creation, bringing new capabilities to musical performances, art installations, video games, and many other applications. See our Colab demo and GitHub repository for usage examples. Magenta RealTime was trained on ~190k hours of stock music from multiple sources, mostly instrumental. Magenta RealTime was trained using Tensor Processing Unit (TPU) hardware (TPUv6e / Trillium). Training was done using JAX and T5X, utilizing SeqIO for data pipelines. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. Model evaluation metrics and results will be shared in our forthcoming technical report.
t5gemma-9b-2b-prefixlm
ncsnpp-celebahq-256
t5gemma-s-s-prefixlm-it
datagemma-rag-27b-it
Gemma 3 12b Pt Qat Q4 0 Gguf
matcha-plotqa-v1
T5 Efficient Mini
T5-Efficient-MINI is a variation of Google's original T5 following the T5 model architecture. It is a pretrained-only checkpoint and was released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. In a nutshell, the paper indicates that a Deep-Narrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count. > We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased > before considering any other forms of uniform scaling across other dimensions. This is largely due to > how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a > tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, > a tall base model might also generally more efficient compared to a large model. We generally find > that, regardless of size, even if absolute performance might increase as we continue to stack layers, > the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 > layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., > params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, > FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to > consider. To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block. This model checkpoint - t5-efficient-mini - is of model type Mini with no variations. It has 31.23 million parameters and thus requires ca. 124.92 MB of memory in full precision (fp32) or 62.46 MB of memory in half precision (fp16 or bf16). A summary of the original T5 model architectures can be seen here: | Model | nl (el/dl) | ff | dm | kv | nh | #Params| | ----| ---- | ---- | ---- | ---- | ---- | ----| | Tiny | 4/4 | 1024 | 256 | 32 | 4 | 16M| | Mini | 4/4 | 1536 | 384 | 32 | 8 | 31M| | Small | 6/6 | 2048 | 512 | 32 | 8 | 60M| | Base | 12/12 | 3072 | 768 | 64 | 12 | 220M| | Large | 24/24 | 4096 | 1024 | 64 | 16 | 738M| | Xl | 24/24 | 16384 | 1024 | 128 | 32 | 3B| | XXl | 24/24 | 65536 | 1024 | 128 | 128 | 11B| | Abbreviation | Definition | | ----| ---- | | nl | Number of transformer blocks (depth) | | dm | Dimension of embedding vector (output vector of transformers block) | | kv | Dimension of key/value projection matrix | | nh | Number of attention heads | | ff | Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) | | el | Number of transformer blocks in the encoder (encoder depth) | | dl | Number of transformer blocks in the decoder (decoder depth) | | sh | Signifies that attention heads are shared | | skv | Signifies that key-values projection matrices are tied | If a model checkpoint has no specific, el or dl than both the number of encoder- and decoder layers correspond to nl. The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective. Note: This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks. You can follow on of the following examples on how to fine-tune the model: - Summarization - Question Answering - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.
t5-efficient-base
switch-base-32
pix2struct-widget-captioning-base
byt5-xl
paligemma2-10b-pt-448
tapas-small-finetuned-wtq
xtr-base-en
metricx-23-qe-large-v2p0
pegasus-x-large
switch-base-64
t5-efficient-tiny-nl2
Ul2
UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. For more information, please take a look at the original paper. Authors: Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler The checkpoint was iteratively pre-trained on C4 and fine-tuned on a variety of datasets The model is pretrained on the C4 corpus. For pretraining, the model is trained on a total of 1 trillion tokens on C4 (2 million steps) with a batch size of 1024. The sequence length is set to 512/512 for inputs and targets. Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384. The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8. The same same sentencepiece tokenizer as T5 of vocab size 32000 is used (click here for more information about the T5 tokenizer). UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs. UL-20B was trained using the Jax and T5X infrastructure. The training objective during pretraining is a mixture of different denoising strategies that are explained in the following: To quote the paper: > We conjecture that a strong universal model has to be exposed to solving diverse set of problems > during pre-training. Given that pre-training is done using self-supervision, we argue that such diversity > should be injected to the objective of the model, otherwise the model might suffer from lack a certain > ability, like long-coherent text generation. > Motivated by this, as well as current class of objective functions, we define three main paradigms that > are used during pre-training: - R-Denoiser: The regular denoising is the standard span corruption introduced in T5 that uses a range of 2 to 5 tokens as the span length, which masks about 15% of input tokens. These spans are short and potentially useful to acquire knowledge instead of learning to generate fluent text. - S-Denoiser: A specific case of denoising where we observe a strict sequential order when framing the inputs-to-targets task, i.e., prefix language modeling. To do so, we simply partition the input sequence into two sub-sequences of tokens as context and target such that the targets do not rely on future information. This is unlike standard span corruption where there could be a target token with earlier position than a context token. Note that similar to the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field. We note that S-Denoising with very short memory or no memory is in similar spirit to standard causal language modeling. - X-Denoiser: An extreme version of denoising where the model must recover a large part of the input, given a small to moderate part of it. This simulates a situation where a model needs to generate long target from a memory with relatively limited information. To do so, we opt to include examples with aggressive denoising where approximately 50% of the input sequence is masked. This is by increasing the span length and/or corruption rate. We consider a pre-training task to be extreme if it has a long span (e.g., ≥ 12 tokens) or have a large corruption rate (e.g., ≥ 30%). X-denoising is motivated by being an interpolation between regular span corruption and language model like objectives. See the following diagram for a more visual explanation: Important: For more details, please see sections 3.1.2 of the paper. The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k. In other words, after each Nk steps of pretraining, the model is finetuned on each downstream task. See section 5.2.2 of paper to get an overview of all datasets that were used for fine-tuning). As the model is continuously finetuned, finetuning is stopped on a task once it has reached state-of-the-art to save compute. In total, the model was trained for 2.65 million steps. Important: For more details, please see sections 5.2.1 and 5.2.2 of the paper. The following shows how one can predict masked passages using the different denoising strategies. Given the size of the model the following examples need to be run on at least a 40GB A100 GPU. For S-Denoising, please make sure to prompt the text with the prefix `[S2S]` as shown below. For R-Denoising, please make sure to prompt the text with the prefix `[NLU]` as shown below. For X-Denoising, please make sure to prompt the text with the prefix `[NLG]` as shown below.
pix2struct-ocrvqa-base
t5gemma-l-l-ul2
paligemma2-28b-mix-448
paligemma-3b-ft-ocrvqa-448
pix2struct-screen2words-base
gemma-3-27b-pt-qat-q4_0-gguf
ddpm-ema-cat-256
DiarizationLM-8b-Fisher-v1
switch-base-128
ddpm-ema-bedroom-256
gemma-2-2b-it-GGUF
gemma-7b-GGUF
timesfm-2.0-500m-jax
t5gemma-b-b-ul2-it
matcha-chart2text-pew
madlad400-7b-mt-bt
gemma-2b-it-GGUF
codegemma-2b-GGUF
pegasus-billsum
paligemma-3b-ft-rsvqa-hr-448
DiarizationLM-13b-Fisher-v1
matcha-chart2text-statista
codegemma-1.1-7b-it
tapas-large-finetuned-sqa
bert2bert_L-24_wmt_en_de
bert_uncased_L-2_H-768_A-12
t5-efficient-tiny-nh8
switch-c-2048
path-foundation
gemma-2b-GGUF
ddpm-bedroom-256
t5gemma-9b-9b-prefixlm
gemma-2b-AWQ
shieldgemma-27b
bert_uncased_L-6_H-512_A-8
madlad400-8b-lm
gemma-2b-aps-it
matcha-base
gemma-7b-it-GGUF
metricx-23-qe-xxl-v2p0
paligemma-3b-pt-896
gemma-3-1b-pt-qat-q4_0-gguf
bert_uncased_L-10_H-128_A-2
gemma-2b-pytorch
paligemma-3b-ft-docvqa-896
paligemma2-10b-pt-896
gemma-3-4b-pt-qat-q4_0-gguf
paligemma-3b-ft-vqav2-448
t5gemma-ml-ml-ul2
pix2struct-screen2words-large
tapas-large-finetuned-wikisql-supervised
ncsnpp-ffhq-256
t5gemma-ml-ml-prefixlm-it
tapas-small
gemma-2-2b-GGUF
byt5-xxl
t5gemma-xl-xl-prefixlm
paligemma-3b-ft-nlvr2-224
electra-large-generator
t5gemma-l-l-prefixlm-it
bert_uncased_L-8_H-128_A-2
paligemma-3b-ft-science-qa-224-jax
gemma-2b-it-keras
cxr-foundation
gemma-1.1-2b-it-GGUF
t5_11b_trueteacher_and_anli
pix2struct-ocrvqa-large
paligemma-3b-ft-vqav2-224
gemma-7b-it-pytorch
hear
gemma-7b-pytorch
codegemma-1.1-2b-keras
gemma-2-instruct-9b-keras
t5-large-ssm-nq
gemma-1.1-2b-it-keras
gemma-1.1-2b-it-pytorch
codegemma-7b-GGUF
gemma-1.1-7b-it-pytorch
gemma-7b-AWQ
paligemma-3b-ft-coco35l-448
gemma-2-9b-keras
paligemma-3b-ft-ocrvqa-224
gemma-2b-keras
t5gemma-xl-xl-ul2-it
seahorse-large-q5
gemma-1.1-7b-it-keras
t5gemma-9b-2b-prefixlm-it
seahorse-large-q4
paligemma-3b-ft-ocrvqa-896
gemma-7b-keras
multiberts-seed_0-step_1900k
paligemma-3b-ft-infovqa-896
ncsnpp-ffhq-1024
tapas-mini-finetuned-wtq
codegemma-2b-keras
codegemma-7b-it-keras
tapas-large-finetuned-tabfact
gemma-7b-it-keras
switch-large-128
codegemma-7b-keras
bert_uncased_L-4_H-768_A-12
codegemma-7b-it-GGUF
pegasus-reddit_tifu
t5-efficient-xl
multiberts-seed_0
multiberts-seed_1
roberta2roberta_L-24_cnn_daily_mail
t5-efficient-tiny-dl2
tapas-tiny
realm-cc-news-pretrained-embedder
pix2struct-ai2d-large
t5gemma-l-l-prefixlm
t5-small-ssm
matcha-plotqa-v2
tapas-medium-finetuned-wtq
switch-xxl-128
metricx-23-xxl-v2p0
bert_uncased_L-10_H-768_A-12
tapas-medium-finetuned-wikisql-supervised
pegasus-wikihow
pegasus-newsroom
tapas-small-finetuned-wikisql-supervised
switch-base-256
paligemma-3b-ft-textcaps-448
roberta2roberta_L-24_wikisplit
realm-orqa-nq-openqa
t5-efficient-large-nl32
gemma-2b-it-pytorch
paligemma2-28b-pt-448
paligemma-3b-ft-vizwizvqa-448
bert_uncased_L-8_H-768_A-12
paligemma2-3b-pt-448-keras
t5gemma-9b-2b-ul2
tapas-base-masklm
Pix2struct Widget Captioning Large
Model card for Pix2Struct - Finetuned on Widget Captioning (Captioning a UI component on a screen) - large version 0. TL;DR 1. Using the model 2. Contribution 3. Citation Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper: The abstract of the model states that: > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. You can use the `convertpix2structcheckpointtopytorch.py` script as follows: Once saved, you can push your converted model with the following snippet: The instructions for running the model are exactly the same as the instructions stated on `pix2struct-textcaps-base` model. This model was originally contributed by Kenton Lee, Mandar Joshi et al. and added to the Hugging Face ecosystem by Younes Belkada. If you want to cite this work, please consider citing the original paper: