google

✓ VerifiedEnterprise

Google's AI research division, creators of Gemini and PaLM

500 models • 85 total models in database
Sort by:

electra-base-discriminator

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset. For a detailed description and experimental results, please refer to our paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).

74,529,428
67

vit-base-patch16-224

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

3,028,249
890

gemma-3-1b-it

NaNK
2,584,047
704

siglip-so400m-patch14-384

--- license: apache-2.0 tags: - vision widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ---

license:apache-2.0
2,485,318
611

vit-base-patch16-224-in21k

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification). By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

license:apache-2.0
2,199,148
378

gemma-3-12b-it

--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-12b-pt ---

NaNK
1,471,709
563

flan-t5-base

--- language: - en - fr - ro - de - multilingual

license:apache-2.0
1,158,149
1,016

siglip-base-patch16-224

--- license: apache-2.0 tags: - vision widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ---

license:apache-2.0
1,085,165
73

gemma-3-4b-it

--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-4b-pt ---

NaNK
934,887
950

paligemma2-3b-pt-224

--- library_name: transformers license: gemma pipeline_tag: image-text-to-text extra_gated_heading: Access PaliGemma on Hugging Face extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---

NaNK
923,919
160

gemma-3-27b-it

--- license: gemma library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/gemma-3-27b-pt ---

NaNK
837,287
1,678

fnet-base

--- language: en tags: - fnet license: apache-2.0 datasets: - c4 ---

license:apache-2.0
813,555
18

siglip2-base-patch16-naflex

--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---

license:apache-2.0
688,082
16

owlv2-base-patch16-ensemble

--- license: apache-2.0 tags: - vision - zero-shot-object-detection inference: false ---

license:apache-2.0
638,421
113

t5-v1_1-xxl

--- language: en datasets: - c4

license:apache-2.0
587,465
139

flan-t5-large

--- language: - en - fr - ro - de - multilingual

license:apache-2.0
564,953
848

flan-t5-small

--- language: - en - fr - ro - de - multilingual

license:apache-2.0
485,039
444

embeddinggemma-300m

--- license: gemma pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - text-embeddings-inference extra_gated_heading: Access EmbeddingGemma on Hugging Face extra_gated_prompt: To access EmbeddingGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_bu

480,107
1,160

medgemma-4b-it

--- license: other license_name: health-ai-developer-foundations license_link: https://developers.google.com/health-ai-developer-foundations/terms library_name: transformers pipeline_tag: image-text-to-text extra_gated_heading: Access MedGemma on Hugging Face extra_gated_prompt: >- To access MedGemma on Hugging Face, you're required to review and agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms). To do this, please ensur

NaNK
472,846
748

siglip2-so400m-patch14-384

--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---

license:apache-2.0
467,749
62

mt5-small

--- language: - multilingual - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - hi - hmn - ht - hu - hy - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - no - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und -

license:apache-2.0
462,451
173

t5gemma-s-s-prefixlm

--- license: gemma library_name: transformers pipeline_tag: text2text-generation extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license base_model: google/t5gemma-s-s-prefixlm ---

458,697
2

siglip2-so400m-patch16-naflex

--- license: apache-2.0 tags: - vision widget: - src: >- https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg candidate_labels: bee in the sky, bee on the flower example_title: Bee library_name: transformers pipeline_tag: zero-shot-image-classification ---

license:apache-2.0
455,315
45

madlad400-3b-mt

--- license: apache-2.0 language: - multilingual - en - ru - es - fr - de - it - pt - pl - nl - vi - tr - sv - id - ro - cs - zh - hu - ja - th - fi - fa - uk - da - el - "no" - bg - sk - ko - ar - lt - ca - sl - he - et - lv - hi - sq - ms - az - sr - ta - hr - kk - is - ml - mr - te - af - gl - fil - be - mk - eu - bn - ka - mn - bs - uz - ur - sw - yue - ne - kn - kaa - gu - si - cy - eo - la - hy - ky - tg - ga - mt - my - km - tt - so - ku - ps - pa - rw - lo - ha - dv - fy - lb - ckb - mg

NaNK
license:apache-2.0
385,653
159

mobilebert-uncased

--- language: en thumbnail: https://huggingface.co/front/thumbnails/google.png

license:apache-2.0
336,208
64

gemma-2-2b-it

--- license: gemma library_name: transformers pipeline_tag: text-generation extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license tags: - conversational base_model: google/gemma-2-2b ---

NaNK
302,483
1,216

gemma-2b

--- library_name: transformers new_version: google/gemma-2-2b license: gemma extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---

NaNK
228,553
1,100

vit-hybrid-base-bit-384

--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k ---

license:apache-2.0
209,487
6

paligemma-3b-mix-224

--- library_name: transformers license: gemma pipeline_tag: image-text-to-text extra_gated_heading: Access PaliGemma on Hugging Face extra_gated_prompt: To access PaliGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---

NaNK
194,835
87

flan-t5-xl

--- language: - en - fr - ro - de - multilingual

license:apache-2.0
183,041
522

owlv2-large-patch14-ensemble

The OWLv2 model (short for Open-World Localization) was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL-ViT, is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. The model uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

license:apache-2.0
175,644
36

siglip2-base-patch16-224

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You...

license:apache-2.0
175,331
75

t5-v1_1-base

license:apache-2.0
165,267
59

ddpm-cifar10-32

license:apache-2.0
160,374
76

gemma-3-270m-it

--- base_model: google/gemma-3-270m license: gemma tags: - gemma3 - gemma - google pipeline_tag: text-generation library_name: transformers extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---

157,382
462

owlvit-base-patch32

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

license:apache-2.0
151,197
143

flan-t5-xxl

0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Bias, Risks, and Limitations 5. Training Details 6. Evaluation 7. Environmental Impact 8. Citation If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract : > Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the T5 model card. - Model type: Language model - Language(s) (NLP): English, German, French - License: Apache 2.0 - Related Models: All FLAN-T5 Checkpoints - Original Checkpoints: All Original FLAN-T5 Checkpoints - Resources for more information: - Research paper - GitHub Repo - Hugging Face FLAN-T5 Docs (Similar to T5) Find below some example scripts on how to use the model in `transformers`: Running the model on a GPU using different precisions The authors write in the original paper's model card that: > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models The information below in this section are copied from the model's official model card: > Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application. > Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data. > Flan-T5 has not been tested in real world applications. > Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech. The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2): According to the model card from the original paper: > These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size. The model has been trained on TPU v3 or TPU v4 pods, using `t5x` codebase together with `jax`. The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation: For full results for FLAN-T5-XXL, see the research paper, Table 3. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4. - Hours used: More information needed - Cloud Provider: GCP - Compute Region: More information needed - Carbon Emitted: More information needed

license:apache-2.0
147,745
1,266

gemma-2-2b

NaNK
145,298
602

vit-base-patch16-384

license:apache-2.0
143,665
46

siglip2-giant-opt-patch16-384

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). Here is how to use this model to perform zero-shot image classification: You can encode an image using the Vision Tower like so: For more code examples, we refer to the siglip documentation. SigLIP 2 adds some clever training objectives on top of SigLIP: 1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023). Evaluation of SigLIP 2 is shown below (taken from the paper).

license:apache-2.0
138,462
28

electra-small-discriminator

license:apache-2.0
119,147
36

gemma-3-270m

118,659
903

mt5-large

license:apache-2.0
117,643
100

bert_uncased_L-2_H-128_A-2

license:apache-2.0
113,135
33

gemma-3n-E2B-it

NaNK
111,591
234

pegasus-xsum

--- language: en tags: - summarization model-index: - name: google/pegasus-xsum results: - task: type: summarization name: Summarization dataset: name: samsum type: samsum config: samsum split: train metrics: - name: ROUGE-1 type: rouge value: 21.8096 verified: true - name: ROUGE-2 type: rouge value: 4.2525 verified: true - name: ROUGE-L type: rouge value: 17.4469 verified: true - name: ROUGE-LSUM type: rouge value: 18.8907 verified: true - name: loss type: loss value: 3.0317161083221436 verifie

108,703
212

gemma-2-9b-it

NaNK
108,344
742

gemma-3-12b-pt

NaNK
100,167
76

siglip2-so400m-patch16-256

license:apache-2.0
98,053
1

metricx-24-hybrid-xxl-v2p6-bfloat16

This is not an officially supported Google product. > ℹ️ For the full-precision (float32) variant of this model, see MetricX-24 (XXL). GitHub repository: https://github.com/google-research/metricx The repository contains the code for running inference on MetricX-24 models, a family of models for automatic evaluation of translations that were proposed in the WMT'24 Metrics Shared Task submission MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task. The models were trained in T5X and then converted for use in PyTorch. There are 3 MetricX-24 models available on Hugging Face that vary in the number of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all hybrid models that can do both reference-based and reference-free (also known as quality estimation, or QE) inference: MetricX-24-Hybrid-XXL MetricX-24-Hybrid-XL MetricX-24-Hybrid-Large We recommend using the XXL model versions for the best agreement with human judgments of translation quality, the Large versions for best speed, and the XL for an intermediate use case. The MetricX-24 models available here are most similar to the primary submission to the WMT'24 Metrics Shared Task. They are initialized with mT5, then fine-tuned on a combination of direct assessment and MQM data from WMT'15-'22. However, we made a couple of small changes that make these models different from the WMT'24 submissions. First, the metric scores get automatically clipped at 0 and 25, to ensure they are strictly in the [0, 25] range, as due to the nature of regression models, the scores could otherwise sometimes fall outside the range. Second, we included one additional type of synthetic training examples that weren't ready in time for the official submission. These are examples of perfect translations of multi-sentence segments, generated from the MQM data from WMT'20-'22. The purpose of this category of synthetic data is to reduce the model's bias against longer translations when the source segment and/or reference are also long. For comparison with the submissions to WMT'24 Metrics Shared Task, we provide an overview of the system- and segment-level correlation scores between the MetricX-24 scores and MQM ratings of translation quality, as calculated on the shared task's test sets: | Model | Sys-Level SPA (en-de) | Seg-Level Acc (en-de) | Sys-Level SPA (en-es) | Seg-Level Acc (en-es) | Sys-Level SPA (ja-zh) | Seg-Level Acc (ja-zh) | | -------------------------- | ----- | ----- | ----- | ----- | ----- | ----- | | MetricX-24-Hybrid-XXL | 0.865 | 0.543 | 0.785 | 0.685 | 0.878 | 0.541 | | MetricX-24-Hybrid-XL | 0.884 | 0.522 | 0.806 | 0.683 | 0.859 | 0.528 | | MetricX-24-Hybrid-Large | 0.879 | 0.511 | 0.795 | 0.686 | 0.845 | 0.514 | | MetricX-24-Hybrid-QE-XXL | 0.884 | 0.525 | 0.789 | 0.685 | 0.863 | 0.527 | | MetricX-24-Hybrid-QE-XL | 0.879 | 0.502 | 0.774 | 0.683 | 0.849 | 0.509 | | MetricX-24-Hybrid-QE-Large | 0.809 | 0.490 | 0.762 | 0.684 | 0.847 | 0.508 | Below are the above correlation scores averaged, as used in the shared task to determine the final ranking of the submissions: | Model | Average Correlation | | -------------------------- | ----- | | MetricX-24-Hybrid-XXL | 0.716 | | MetricX-24-Hybrid-XL | 0.714 | | MetricX-24-Hybrid-Large | 0.705 | | MetricX-24-Hybrid-QE-XXL | 0.712 | | MetricX-24-Hybrid-QE-XL | 0.699 | | MetricX-24-Hybrid-QE-Large | 0.683 | NOTE: Since MetricX-24 models are hybrid models, MetricX-24-\ and MetricX-24-QE-\ correspond to the same model, evaluated with and without the references, respectively. If you use MetricX-24 in your research, please cite the following publication:

license:apache-2.0
96,749
1

siglip2-base-patch16-512

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). Here is how to use this model to perform zero-shot image classification: You can encode an image using the Vision Tower like so: For more code examples, we refer to the siglip documentation. SigLIP 2 adds some clever training objectives on top of SigLIP: 1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023). Evaluation of SigLIP 2 is shown below (taken from the paper).

license:apache-2.0
94,194
30

gemma-3-12b-it-qat-q4_0-gguf

NaNK
88,590
203

gemma-3-27b-it-qat-q4_0-gguf

NaNK
83,005
358

mt5-base

--- language: - multilingual - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - hi - hmn - ht - hu - hy - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - no - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und -

license:apache-2.0
80,681
253

videoprism-base-f16r288

license:apache-2.0
69,785
88

siglip-so400m-patch14-224

license:apache-2.0
69,069
54

gemma-2b-it

--- library_name: transformers license: gemma new_version: google/gemma-2-2b-it widget: - messages: - role: user content: How does the brain work? inference: parameters: max_new_tokens: 200 extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Ackn

NaNK
68,062
820

siglip2-base-patch16-256

license:apache-2.0
55,152
6

siglip2-large-patch16-256

license:apache-2.0
55,067
4

byt5-small

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-small` significantly outperforms mt5-small on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

license:apache-2.0
53,105
83

gemma-3-4b-pt

NaNK
52,073
119

vivit-b-16x2-kinetics400

ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository. Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team. ViViT is an extension of the Vision Transformer (ViT) to video. The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.

license:mit
50,482
35

gemma-3-1b-pt

NaNK
48,621
165

vit-large-patch16-224

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at the same resolution, 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

license:apache-2.0
48,519
40

mobilenet_v2_1.0_224

48,228
36

siglip2-so400m-patch16-512

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. You...

license:apache-2.0
47,713
38

siglip2-so400m-patch16-384

license:apache-2.0
46,730
4

byt5-base

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-base` significantly outperforms mt5-base on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

license:apache-2.0
46,178
27

siglip2-so400m-patch14-224

license:apache-2.0
45,395
1

t5-v1_1-xl

T5 Version 1.1 includes the following improvements compared to the original T5 model- GEGLU activation in feed-forward hidden layer, rather than ReLU - see here. - Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning. - Pre-trained on C4 only without mixing in the downstream tasks. - no parameter sharing between embedding and classifier layer - "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger `dmodel` and smaller `numheads` and `dff`. Note: T5 Version 1.1 was only pre-trained on C4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Pretraining Dataset: C4 Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

license:apache-2.0
42,192
16

gemma-3n-E4B-it

NaNK
41,999
815

long-t5-tglobal-base

LongT5 (transient-global attention, base-sized model) LongT5 model pre-trained on English language. The model was introduced in the paper LongT5: Efficient Text-To-Text Transformer for Long Sequences by Guo et al. and first released in the LongT5 repository. All the model architecture and configuration can be found in Flaxformer repository which uses another Google research project repository T5x. Disclaimer: The team releasing LongT5 did not write a model card for this model so this model card has been written by the Hugging Face team. Model description LongT5 model is an encoder-decoder transformer pre-trained in a text-to-text denoising generative setting (Pegasus-like generation pre-training). LongT5 model is an extension of T5 model, and it enables using one of the two different efficient attention mechanisms - (1) Local attention, or (2) Transient-Global attention. The usage of attention sparsity patterns allows the model to efficiently handle input sequence. LongT5 is particularly effective when fine-tuned for text generation (summarization, question answering) which requires handling long input sequences (up to 16,384 tokens). The model is mostly meant to be fine-tuned on a supervised dataset. See the model hub to look for fine-tuned versions on a task that interests you.

license:apache-2.0
41,230
47

byt5-large

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task. ByT5 works especially well on noisy text data,e.g., `google/byt5-large` significantly outperforms mt5-large on TweetQA. Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel ByT5 works on raw UTF-8 bytes and can be used without a tokenizer: For batched inference & training it is however recommended using a tokenizer class for padding: Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

license:apache-2.0
40,110
16

bigbird-roberta-base

license:apache-2.0
36,546
56

paligemma-3b-pt-224

NaNK
36,513
371

owlv2-base-patch16

The OWLv2 model (short for Open-World Localization) was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL-ViT, is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. The model uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

license:apache-2.0
33,944
28

gemma-3n-E2B-it-litert-lm

NaNK
33,886
204

gemma-2-9b

NaNK
30,922
677

owlvit-base-patch16

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

license:apache-2.0
30,377
12

gemma-7b

NaNK
30,012
3,229

tapas-base-finetuned-wtq

TAPAS base model fine-tuned on WikiTable Questions (WTQ) This model has 2 versions which can be used. The default version corresponds to the `tapaswtqwikisqlsqaintermasklmbasereset` checkpoint of the original Github repository. This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned in a chain on SQA, WikiSQL and finally WTQ. It uses relative position embeddings (i.e. resetting the position index at every cell of the table). The other (non-default) version which can be used is: - `noreset`, which corresponds to `tapaswtqwikisqlsqaintermasklmbase` (intermediate pre-training, absolute position embeddings). Disclaimer: The team releasing TAPAS did not write a model card for this model so this model card has been written by the Hugging Face team and contributors. Size | Reset | Dev Accuracy | Link -------- | --------| -------- | ---- LARGE | noreset | 0.5062 | tapas-large-finetuned-wtq (with absolute pos embeddings) LARGE | reset | 0.5097 | tapas-large-finetuned-wtq BASE | noreset | 0.4525 | tapas-base-finetuned-wtq (with absolute pos embeddings) BASE | reset | 0.4638 | tapas-base-finetuned-wtq MEDIUM | noreset | 0.4324 | tapas-medium-finetuned-wtq (with absolute pos embeddings) MEDIUM | reset | 0.4324 | tapas-medium-finetuned-wtq SMALL | noreset | 0.3681 | tapas-small-finetuned-wtq (with absolute pos embeddings) SMALL | reset | 0.3762 | tapas-small-finetuned-wtq MINI | noreset | 0.2783 | tapas-mini-finetuned-wtq (with absolute pos embeddings) MINI | reset | 0.2854 | tapas-mini-finetuned-wtq TINY | noreset | 0.0823 | tapas-tiny-finetuned-wtq (with absolute pos embeddings) TINY | reset | 0.1039 | tapas-tiny-finetuned-wtq TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a (flattened) table and associated context, the model randomly masks 15% of the words in the input, then runs the entire (partially masked) sequence through the model. The model then has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of a table and associated text. - Intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating a balanced dataset of millions of syntactically created training examples. Here, the model must predict (classify) whether a sentence is supported or refuted by the contents of a table. The training examples are created based on synthetic as well as counterfactual statements. This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table. Fine-tuning is done by adding a cell selection head and aggregation head on top of the pre-trained model, and then jointly train these randomly initialized classification heads with the base model on SQa, WikiSQL and finally WTQ. You can use this model for answering questions related to a table. For code examples, we refer to the documentation of TAPAS on the HuggingFace website. The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: The authors did first convert the WTQ dataset into the format of SQA using automatic conversion scripts. The model was fine-tuned on 32 Cloud TPU v3 cores for 50,000 steps with maximum sequence length 512 and batch size of 512. In this setup, fine-tuning takes around 10 hours. The optimizer used is Adam with a learning rate of 1.93581e-5, and a warmup ratio of 0.128960. An inductive bias is added such that the model only selects cells of the same column. This is reflected by the `selectonecolumn` parameter of `TapasConfig`. See the paper for more details (tables 11 and 12).

license:apache-2.0
29,971
232

paligemma2-28b-pt-896

NaNK
29,884
49

siglip2-base-patch32-256

license:apache-2.0
29,773
9

owlvit-large-patch14

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. OWL-ViT is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries. OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. The model uses a CLIP backbone with a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

license:apache-2.0
29,019
26

siglip2-large-patch16-384

license:apache-2.0
27,982
2

medgemma-27b-text-it

NaNK
26,672
369

siglip-large-patch16-384

license:apache-2.0
26,394
9

siglip-large-patch16-256

license:apache-2.0
25,997
12

siglip-base-patch16-256

license:apache-2.0
24,862
6

paligemma2-3b-ft-docci-448

NaNK
24,393
13

paligemma2-10b-pt-224

NaNK
24,267
8

t5gemma-9b-9b-ul2

NaNK
24,189
5

vit-large-patch16-224-in21k

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not provide any fine-tuned heads, as these were zero'd by Google researchers. However, the model does include the pre-trained pooler, which can be used for downstream tasks (such as image classification). By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model to embed images, but it's mostly intended to be fine-tuned on a downstream task. Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

license:apache-2.0
24,165
30

efficientnet-b0

license:apache-2.0
21,685
22

siglip2-base-patch16-384

license:apache-2.0
20,425
8

vit-large-patch16-384

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

license:apache-2.0
19,936
15

pegasus-large

19,442
104

t5gemma-b-b-ul2

18,935
2

siglip2-large-patch16-512

license:apache-2.0
18,670
12

videoprism-lvt-base-f16r288

license:apache-2.0
18,366
9

t5-efficient-tiny

T5-Efficient-TINY is a variation of Google's original T5 following the T5 model architecture. It is a pretrained-only checkpoint and was released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. In a nutshell, the paper indicates that a Deep-Narrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count. > We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased > before considering any other forms of uniform scaling across other dimensions. This is largely due to > how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a > tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, > a tall base model might also generally more efficient compared to a large model. We generally find > that, regardless of size, even if absolute performance might increase as we continue to stack layers, > the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 > layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., > params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, > FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to > consider. To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block. This model checkpoint - t5-efficient-tiny - is of model type Tiny with no variations. It has 15.58 million parameters and thus requires ca. 62.32 MB of memory in full precision (fp32) or 31.16 MB of memory in half precision (fp16 or bf16). A summary of the original T5 model architectures can be seen here: | Model | nl (el/dl) | ff | dm | kv | nh | #Params| | ----| ---- | ---- | ---- | ---- | ---- | ----| | Tiny | 4/4 | 1024 | 256 | 32 | 4 | 16M| | Mini | 4/4 | 1536 | 384 | 32 | 8 | 31M| | Small | 6/6 | 2048 | 512 | 32 | 8 | 60M| | Base | 12/12 | 3072 | 768 | 64 | 12 | 220M| | Large | 24/24 | 4096 | 1024 | 64 | 16 | 738M| | Xl | 24/24 | 16384 | 1024 | 128 | 32 | 3B| | XXl | 24/24 | 65536 | 1024 | 128 | 128 | 11B| | Abbreviation | Definition | | ----| ---- | | nl | Number of transformer blocks (depth) | | dm | Dimension of embedding vector (output vector of transformers block) | | kv | Dimension of key/value projection matrix | | nh | Number of attention heads | | ff | Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) | | el | Number of transformer blocks in the encoder (encoder depth) | | dl | Number of transformer blocks in the decoder (decoder depth) | | sh | Signifies that attention heads are shared | | skv | Signifies that key-values projection matrices are tied | If a model checkpoint has no specific, el or dl than both the number of encoder- and decoder layers correspond to nl. The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective. Note: This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks. You can follow on of the following examples on how to fine-tune the model: - Summarization - Question Answering - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.

license:apache-2.0
17,103
28

siglip-base-patch16-256-multilingual

license:apache-2.0
17,079
51

umt5-xxl

UMT5 is pretrained on the an updated version of mC4 corpus, covering 107 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu. Note: UMT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Paper: UniMax, Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining Authors: by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.

license:apache-2.0
16,780
49

bert_uncased_L-4_H-256_A-4

This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12

license:apache-2.0
16,596
10

vit-base-patch32-224-in21k

license:apache-2.0
16,580
19

gemma-2-2b-jpn-it

NaNK
16,052
208

medsiglip-448

15,871
83

metricx-24-hybrid-xl-v2p6

license:apache-2.0
15,859
8

siglip-base-patch16-384

license:apache-2.0
15,462
10

bert_uncased_L-12_H-512_A-8

This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12

license:apache-2.0
15,284
0

medgemma-27b-it

NaNK
14,928
221

electra-small-generator

license:apache-2.0
12,982
13

bigbird-pegasus-large-arxiv

license:apache-2.0
12,629
64

vit-large-patch32-384

license:apache-2.0
12,219
17

bert_uncased_L-4_H-512_A-8

This is the set of 24 BERT models referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models (English only, uncased, trained with WordPiece masking). We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. You can download the 24 BERT miniatures either from the official BERT Github page, or via HuggingFace from the links below: | |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |[2/128 (BERT-Tiny)][2128]|[2/256][2256]|[2/512][2512]|[2/768][2768]| | L=4 |[4/128][4128]|[4/256 (BERT-Mini)][4256]|[4/512 (BERT-Small)][4512]|[4/768][4768]| | L=6 |[6/128][6128]|[6/256][6256]|[6/512][6512]|[6/768][6768]| | L=8 |[8/128][8128]|[8/256][8256]|[8/512 (BERT-Medium)][8512]|[8/768][8768]| | L=10 |[10/128][10128]|[10/256][10256]|[10/512][10512]|[10/768][10768]| | L=12 |[12/128][12128]|[12/256][12256]|[12/512][12512]|[12/768 (BERT-Base)][12768]| Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. Here are the corresponding GLUE scores on the test set: |Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: - batch sizes: 8, 16, 32, 64, 128 - learning rates: 3e-4, 1e-4, 5e-5, 3e-5 If you use these models, please cite the following paper: [2128]: https://huggingface.co/google/bertuncasedL-2H-128A-2 [2256]: https://huggingface.co/google/bertuncasedL-2H-256A-4 [2512]: https://huggingface.co/google/bertuncasedL-2H-512A-8 [2768]: https://huggingface.co/google/bertuncasedL-2H-768A-12 [4128]: https://huggingface.co/google/bertuncasedL-4H-128A-2 [4256]: https://huggingface.co/google/bertuncasedL-4H-256A-4 [4512]: https://huggingface.co/google/bertuncasedL-4H-512A-8 [4768]: https://huggingface.co/google/bertuncasedL-4H-768A-12 [6128]: https://huggingface.co/google/bertuncasedL-6H-128A-2 [6256]: https://huggingface.co/google/bertuncasedL-6H-256A-4 [6512]: https://huggingface.co/google/bertuncasedL-6H-512A-8 [6768]: https://huggingface.co/google/bertuncasedL-6H-768A-12 [8128]: https://huggingface.co/google/bertuncasedL-8H-128A-2 [8256]: https://huggingface.co/google/bertuncasedL-8H-256A-4 [8512]: https://huggingface.co/google/bertuncasedL-8H-512A-8 [8768]: https://huggingface.co/google/bertuncasedL-8H-768A-12 [10128]: https://huggingface.co/google/bertuncasedL-10H-128A-2 [10256]: https://huggingface.co/google/bertuncasedL-10H-256A-4 [10512]: https://huggingface.co/google/bertuncasedL-10H-512A-8 [10768]: https://huggingface.co/google/bertuncasedL-10H-768A-12 [12128]: https://huggingface.co/google/bertuncasedL-12H-128A-2 [12256]: https://huggingface.co/google/bertuncasedL-12H-256A-4 [12512]: https://huggingface.co/google/bertuncasedL-12H-512A-8 [12768]: https://huggingface.co/google/bertuncasedL-12H-768A-12

license:apache-2.0
12,218
5

vit-huge-patch14-224-in21k

license:apache-2.0
11,993
22

gemma-3n-E4B-it-litert-lm

NaNK
10,590
203

muril-base-cased

license:apache-2.0
10,567
52

timesfm-2.0-500m-pytorch

license:apache-2.0
10,512
228

mobilenet_v1_0.75_192

10,391
2

bert_uncased_L-2_H-512_A-8

license:apache-2.0
10,360
0

vit-base-patch32-384

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, at a higher resolution of 384x384. Images are presented to the model as a sequence of fixed-size patches (resolution 32x32), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Pre-training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

license:apache-2.0
9,984
23

t5-v1_1-small

license:apache-2.0
9,706
26

siglip-base-patch16-512

license:apache-2.0
9,581
28

efficientnet-b7

EfficientNet model trained on ImageNet-1k at resolution 600x600. It was introduced in the paper EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks by Mingxing Tan and Quoc V. Le, and first released in this repository. Disclaimer: The team releasing EfficientNet did not write a model card for this model so this model card has been written by the Hugging Face team. EfficientNet is a mobile friendly pure convolutional model (ConvNet) that proposes a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.

license:apache-2.0
9,472
25

paligemma2-3b-mix-224

NaNK
9,227
38

gemma-3-27b-pt

NaNK
8,782
110

umt5-xl

UMT5 is pretrained on the an updated version of mC4 corpus, covering 107 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu. Note: UMT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. Paper: UniMax, Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining Authors: by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.

license:apache-2.0
8,405
18

pegasus-cnn_dailymail

8,023
106

shieldgemma-2-4b-it

NaNK
7,896
133

ddpm-ema-celebahq-256

license:apache-2.0
7,501
12

ddpm-celebahq-256

license:apache-2.0
7,430
55

t5gemma-2b-2b-ul2

NaNK
7,419
12

gemma-3n-E4B

NaNK
6,934
102

electra-large-discriminator

license:apache-2.0
6,467
15

t5gemma-2b-2b-prefixlm-it

NaNK
6,463
4

gemma-3-4b-it-qat-q4_0-unquantized

NaNK
6,387
9

muril-large-cased

6,304
17

paligemma2-3b-mix-448

NaNK
6,268
52

t5-v1_1-large

license:apache-2.0
6,099
18

txgemma-27b-predict

NaNK
5,650
35

pix2struct-base

license:apache-2.0
5,434
76

bert_uncased_L-12_H-768_A-12

license:apache-2.0
5,396
14

canine-c

license:apache-2.0
5,381
34

paligemma-3b-mix-448

NaNK
5,276
115

tapas-large-finetuned-wtq

license:apache-2.0
5,114
147

bert_uncased_L-8_H-512_A-8

license:apache-2.0
4,884
4

efficientnet-b1

license:apache-2.0
4,838
1

codegemma-2b

NaNK
4,682
87

shieldgemma-2b

NaNK
4,581
92

siglip2-giant-opt-patch16-256

license:apache-2.0
4,511
2

canine-s

license:apache-2.0
4,495
27

txgemma-2b-predict

NaNK
4,337
43

owlv2-large-patch14

license:apache-2.0
4,246
7

t5-efficient-small

license:apache-2.0
4,242
4

vaultgemma-1b

NaNK
3,904
401

codegemma-7b

--- library_name: transformers license: gemma license_link: https://ai.google.dev/gemma/terms extra_gated_heading: Access CodeGemma on Hugging Face extra_gated_prompt: To access CodeGemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged-in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license ---

NaNK
3,772
206

shieldgemma-9b

NaNK
3,731
24

tapas-base

license:apache-2.0
3,604
9

pix2struct-textcaps-base

license:apache-2.0
3,568
29

paligemma2-3b-pt-448

NaNK
3,498
46

deplot

license:apache-2.0
3,360
309

vit-large-patch32-224-in21k

license:apache-2.0
3,320
1

gemma-3-4b-it-qat-q4_0-gguf

NaNK
3,283
205

bert_uncased_L-4_H-128_A-2

license:apache-2.0
3,245
0

madlad400-10b-mt

NaNK
license:apache-2.0
3,227
125

efficientnet-b2

license:apache-2.0
3,084
3

owlv2-base-patch16-finetuned

license:apache-2.0
3,053
3

medgemma-4b-pt

NaNK
2,988
127

timesfm-1.0-200m-pytorch

license:apache-2.0
2,957
29

t5_xxl_true_nli_mixture

This is an NLI model based on T5-XXL that predicts a binary label ('1' - Entailment, '0' - No entailment). It is trained similarly to the NLI model described in the TRUE paper (Honovich et al, 2022), but using the following datasets instead of ANLI: - SNLI (Bowman et al., 2015) - MNLI (Williams et al., 2018) - Fever (Thorne et al., 2018) - Scitail (Khot et al., 2018) - PAWS (Zhang et al. 2019) - VitaminC (Schuster et al., 2021) The input format for the model is: "premise: PREMISETEXT hypothesis: HYPOTHESISTEXT". If you use this model for a research publication, please cite the TRUE paper (using the bibtex entry below) and the dataset papers mentioned above.

license:apache-2.0
2,930
50

pix2struct-docvqa-base

license:apache-2.0
2,852
39

embeddinggemma-300m-qat-q8_0-unquantized

2,756
28

bit-50

The BiT model was proposed in Big Transfer (BiT): General Visual Representation Learning by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning. Disclaimer: The team releasing ResNet did not write a model card for this model so this model card has been written by the Hugging Face team. Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.

license:apache-2.0
2,675
5

bert_uncased_L-6_H-768_A-12

license:apache-2.0
2,597
4

bigbird-roberta-large

license:apache-2.0
2,595
29

codegemma-7b-it

NaNK
2,542
238

rembert

license:apache-2.0
2,434
21

bert_uncased_L-12_H-256_A-4

license:apache-2.0
2,399
1

roberta2roberta_L-24_discofuse

license:apache-2.0
2,372
2

gemma-3-1b-it-qat-q4_0-gguf

NaNK
2,371
96

metricx-24-hybrid-large-v2p6

license:apache-2.0
2,280
4

gemma-3n-E2B

NaNK
1,960
75

tapas-base-finetuned-tabfact

license:apache-2.0
1,936
1

reformer-crime-and-punishment

1,870
11

paligemma2-10b-mix-448

NaNK
1,814
34

efficientnet-b4

license:apache-2.0
1,769
2

bert_for_seq_generation_L-24_bbc_encoder

1,728
1

t5gemma-2b-2b-ul2-it

NaNK
1,616
8

long-t5-tglobal-large

license:apache-2.0
1,582
15

metricx-23-xl-v2p0

license:apache-2.0
1,577
1

gemma-3-1b-it-qat-q4_0-unquantized

NaNK
1,561
9

siglip-so400m-patch16-256-i18n

license:apache-2.0
1,498
30

tapas-base-finetuned-sqa

license:apache-2.0
1,490
7

long-t5-local-base

license:apache-2.0
1,331
15

t5gemma-b-b-prefixlm-it

1,272
18

pix2struct-ai2d-base

license:apache-2.0
1,234
43

ddpm-ema-church-256

license:apache-2.0
1,200
13

mobilenet_v1_1.0_224

1,184
1

long-t5-tglobal-xl

license:apache-2.0
1,155
22

gemma-3-12b-it-qat-int4-unquantized

NaNK
1,104
11

deeplabv3_mobilenet_v2_1.0_513

1,095
9

videoprism-lvt-large-f8r288

license:apache-2.0
1,078
12

ddpm-cat-256

license:apache-2.0
1,067
14

t5gemma-9b-2b-ul2-it

NaNK
1,010
4

metricx-24-hybrid-large-v2p6-bfloat16

license:apache-2.0
997
1

efficientnet-b3

license:apache-2.0
995
5

gemma-3-27b-it-qat-q4_0-unquantized

NaNK
992
36

ddpm-church-256

license:apache-2.0
983
11

switch-base-16

license:apache-2.0
983
4

tapas-tiny-finetuned-sqa

license:apache-2.0
967
1

pix2struct-large

license:apache-2.0
934
34

efficientnet-b5

license:apache-2.0
922
2

t5gemma-s-s-ul2

NaNK
866
2

pegasus-x-base

859
11

bigbird-pegasus-large-pubmed

license:apache-2.0
856
47

long-t5-local-large

license:apache-2.0
849
5

paligemma-3b-pt-448

NaNK
793
30

paligemma2-10b-mix-224

NaNK
784
9

reformer-enwik8

752
20

mobilenet_v2_1.4_224

746
2

gemma-3-12b-it-qat-q4_0-unquantized

NaNK
721
16

embeddinggemma-300m-qat-q4_0-unquantized

698
39

bigbird-base-trivia-itc

license:apache-2.0
672
8

metricx-24-hybrid-xl-v2p6-bfloat16

license:apache-2.0
671
0

gemma-3-270m-it-qat-q4_0-unquantized

NaNK
660
11

bert_uncased_L-2_H-256_A-4

license:apache-2.0
649
2

DiarizationLM-8b-Fisher-v2

NaNK
llama
640
30

paligemma2-3b-pt-896

NaNK
639
22

t5gemma-l-l-ul2-it

NaNK
637
2

t5-xl-lm-adapt

license:apache-2.0
624
14

gemma-3-270m-qat-q4_0-unquantized

NaNK
614
8

bert2bert_L-24_wmt_de_en

NaNK
license:apache-2.0
610
8

t5gemma-b-b-prefixlm

NaNK
588
11

gemma-3-4b-it-qat-int4-unquantized

NaNK
578
8

Gemma 7b Aps It

NaNK
577
42

bert_uncased_L-8_H-256_A-4

license:apache-2.0
562
0

vivit-b-16x2

license:mit
543
11

txgemma-9b-chat

NaNK
542
41

roberta2roberta_L-24_bbc

license:apache-2.0
534
3

paligemma2-10b-ft-docci-448

NaNK
533
17

hear-pytorch

517
10

tapas-tiny-finetuned-wtq

license:apache-2.0
512
1

txgemma-27b-chat

NaNK
508
56

madlad400-7b-mt

NaNK
license:apache-2.0
506
20

txgemma-9b-predict

NaNK
497
25

paligemma2-28b-mix-224

NaNK
474
4

tapas-large

license:apache-2.0
473
3

paligemma-3b-ft-gqa-224

NaNK
473
0

pegasus-multi_news

471
26

electra-base-generator

license:apache-2.0
454
8

bert_uncased_L-12_H-128_A-2

license:apache-2.0
436
0

t5-small-ssm-nq

license:apache-2.0
421
1

paligemma-3b-ft-nlvr2-448

NaNK
418
1

mobilenet_v2_0.35_96

410
1

bigbird-pegasus-large-bigpatent

license:apache-2.0
406
40

tapas-base-finetuned-wikisql-supervised

license:apache-2.0
401
9

datagemma-rig-27b-it

NaNK
394
107

tapas-small-finetuned-sqa

license:apache-2.0
392
1

t5gemma-2b-2b-prefixlm

NaNK
391
4

metricx-23-qe-xl-v2p0

license:apache-2.0
383
2

t5-large-lm-adapt

license:apache-2.0
377
8

t5gemma-ml-ml-ul2-it

NaNK
374
2

pegasus-pubmed

348
9

owlv2-large-patch14-finetuned

license:apache-2.0
345
4

gemma-3-1b-it-qat-int4-unquantized

NaNK
336
11

t5-small-lm-adapt

license:apache-2.0
322
9

pegasus-x-base-arxiv

316
1

pegasus-arxiv

314
2

pix2struct-chartqa-base

license:apache-2.0
312
9

umt5-small

license:apache-2.0
310
24

mobilenet_v2_0.75_160

305
2

timesfm-1.0-200m

license:apache-2.0
301
773

t5-base-lm-adapt

license:apache-2.0
297
18

t5gemma-9b-9b-prefixlm-it

NaNK
297
5

matcha-chartqa

license:apache-2.0
296
47

t5gemma-s-s-ul2-it

NaNK
282
2

efficientnet-b6

license:apache-2.0
282
0

t5gemma-9b-9b-ul2-it

NaNK
277
3

t5-xxl-lm-adapt

license:apache-2.0
273
10

t5gemma-xl-xl-prefixlm-it

NaNK
270
6

Derm Foundation

269
68

pix2struct-docvqa-large

license:apache-2.0
269
32

bert_uncased_L-6_H-256_A-4

license:apache-2.0
269
1

metricx-23-large-v2p0

license:apache-2.0
267
5

videoprism-large-f8r288

license:apache-2.0
260
17

Metricx 24 Hybrid Xxl V2p6

license:apache-2.0
260
9

paligemma-3b-ft-cococap-448

NaNK
257
3

Magenta Realtime

- Blog Post - Paper - Colab Demo - Repository - HuggingFace Magenta RealTime is offered under a combination of licenses: the codebase is licensed under Apache 2.0, and the model weights under Creative Commons Attribution 4.0 International. In addition, we specify the following usage terms: Use these materials responsibly and do not generate content, including outputs, that infringe or violate the rights of others, including rights in copyrighted content. Google claims no rights in outputs you generate using Magenta RealTime. You and your users are solely responsible for outputs and their subsequent uses. Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses. You are solely responsible for determining the appropriateness of using, reproducing, modifying, performing, displaying or distributing the software and materials, and any outputs, and assume any and all risks associated with your use or distribution of any of the software and materials, and any outputs, and your exercise of rights and permissions under the licenses. Magenta RealTime is an open music generation model from Google built from the same research and technology used to create MusicFX DJ and Lyria RealTime. Magenta RealTime enables the continuous generation of musical audio steered by a text prompt, an audio example, or a weighted combination of multiple text prompts and/or audio examples. Its relatively small size makes it possible to deploy in environments with limited resources, including live performance settings or freely available Colab TPUs. Magenta RealTime is composed of three components: SpectroStream, MusicCoCa, and an LLM. A full technical report with more details on each component is here. 1. SpectroStream is a discrete audio codec that converts stereo 48kHz audio into tokens, building on the SoundStream RVQ codec from Zeghidour+ 21 1. MusicCoCa is a contrastive-trained model capable of embedding audio and text into a common embedding space, building on Yu+ 22 and Huang+ 22. 1. An encoder-decoder Transformer LLM generates audio tokens given context audio tokens and a tokenized MusicCoCa embedding, building on the MusicLM method from Agostinelli+ 23 - SpectroStream RVQ codec: Tokenizes high-fidelity music audio - Encoder input / Decoder output: Music audio waveforms, 48kHz stereo - Encoder output / Decoder input: Discrete audio tokens, 25Hz frame rate, 64 RVQ depth, 10 bit codes, 16kbps - MusicCoCa: Joint embeddings of text and music audio - Input: Music audio waveforms, 16kHz mono, or text representation of music style e.g. "heavy metal" - Output: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit codes - Encoder-decoder Transformer LLM: Generates audio tokens given context and style - Encoder Input: (Context, 1000 tokens) 10s of audio context tokens w/ 4 RVQ depth, (Style, 6 tokens) Quantized MusicCoCa style embedding - Decoder Output: (Generated, 800 tokens) 2s of audio w/ 16 RVQ depth Music generation models, in particular ones targeted for continuous real-time generation and control, have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Interactive Music Creation - Live Performance / Improvisation: These models can be used to generate music in a live performance setting, controlled by performers manipulating style embeddings or the audio context - Accessible Music-Making & Music Therapy: People with impediments to using traditional instruments (skill gaps, disabilities, etc.) can participate in communal jam sessions or solo music creation. - Video Games: Developers can create a custom soundtrack for users in real-time based on their actions and environment. - Research - Transfer learning: Researchers can leverage representations from MusicCoCa and Magenta RT to recognize musical information. - Personalization - Musicians can finetune models with their own catalog to customize the model to their style (fine tuning support coming soon). - Education - Exploring Genres, Instruments, and History: Natural language prompting enables users to quickly learn about and experiment with musical concepts. See our Terms of Use above for usage we consider out of scope. Magenta RT supports the real-time generation and steering of instrumental music. The purpose and intention of this capability is to foster the development of new real-time, interactive co-creation workflows that seamlessly integrate with human-centered forms of musical creativity. Every AI music generation model, including Magenta RT, carries a risk of impacting the economic and cultural landscape of music. We aim to mitigate these risks through the following avenues: - Prioritizing human-AI interaction as fundamental in the design of Magenta RT. - Distributing the model under a terms of service that prohibit developers from generating outputs that infringe or violate the rights of others, including rights in copyrighted content. - Training on primarily instrumental data. With specific prompting, this model has been observed to generate some vocal sounds and effects, though those vocal sounds and effects tend to be non-lexical. Coverage of broad musical styles. Magenta RT's training data primarily consists of Western instrumental music. As a consequence, Magenta RT has incomplete coverage of both vocal performance and the broader landscape of rich musical traditions worldwide. For real-time generation with broader style coverage, we refer users to our Lyria RealTime API. Vocals. While the model is capable of generating non-lexical vocalizations and humming, it is not conditioned on lyrics and is unlikely to generate actual words. However, there remains some risk of generating explicit or culturally-insensitive lyrical content. Latency. Because the Magenta RT LLM operates on two second chunks, user inputs for the style prompt may take two or more seconds to influence the musical output. Limited context. Because the Magenta RT encoder has a maximum audio context window of ten seconds, the model is unable to directly reference music that has been output earlier than that. While the context is sufficient to enable the model to create melodies, rhythms, and chord progressions, the model is not capable of automatically creating longer-term song structures. At the time of release, Magenta RealTime represents the only open weights model supporting real-time, continuous musical audio generation. It is designed specifically to enable live, interactive musical creation, bringing new capabilities to musical performances, art installations, video games, and many other applications. See our Colab demo and GitHub repository for usage examples. Magenta RealTime was trained on ~190k hours of stock music from multiple sources, mostly instrumental. Magenta RealTime was trained using Tensor Processing Unit (TPU) hardware (TPUv6e / Trillium). Training was done using JAX and T5X, utilizing SeqIO for data pipelines. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. Model evaluation metrics and results will be shared in our forthcoming technical report.

license:cc-by-4.0
255
519

t5gemma-9b-2b-prefixlm

NaNK
249
1

ncsnpp-celebahq-256

license:apache-2.0
241
1

t5gemma-s-s-prefixlm-it

NaNK
228
3

datagemma-rag-27b-it

NaNK
224
188

Gemma 3 12b Pt Qat Q4 0 Gguf

NaNK
220
17

matcha-plotqa-v1

license:apache-2.0
220
3

T5 Efficient Mini

T5-Efficient-MINI is a variation of Google's original T5 following the T5 model architecture. It is a pretrained-only checkpoint and was released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler. In a nutshell, the paper indicates that a Deep-Narrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count. > We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased > before considering any other forms of uniform scaling across other dimensions. This is largely due to > how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a > tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, > a tall base model might also generally more efficient compared to a large model. We generally find > that, regardless of size, even if absolute performance might increase as we continue to stack layers, > the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 > layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., > params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, > FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to > consider. To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block. This model checkpoint - t5-efficient-mini - is of model type Mini with no variations. It has 31.23 million parameters and thus requires ca. 124.92 MB of memory in full precision (fp32) or 62.46 MB of memory in half precision (fp16 or bf16). A summary of the original T5 model architectures can be seen here: | Model | nl (el/dl) | ff | dm | kv | nh | #Params| | ----| ---- | ---- | ---- | ---- | ---- | ----| | Tiny | 4/4 | 1024 | 256 | 32 | 4 | 16M| | Mini | 4/4 | 1536 | 384 | 32 | 8 | 31M| | Small | 6/6 | 2048 | 512 | 32 | 8 | 60M| | Base | 12/12 | 3072 | 768 | 64 | 12 | 220M| | Large | 24/24 | 4096 | 1024 | 64 | 16 | 738M| | Xl | 24/24 | 16384 | 1024 | 128 | 32 | 3B| | XXl | 24/24 | 65536 | 1024 | 128 | 128 | 11B| | Abbreviation | Definition | | ----| ---- | | nl | Number of transformer blocks (depth) | | dm | Dimension of embedding vector (output vector of transformers block) | | kv | Dimension of key/value projection matrix | | nh | Number of attention heads | | ff | Dimension of intermediate vector within transformer block (size of feed-forward projection matrix) | | el | Number of transformer blocks in the encoder (encoder depth) | | dl | Number of transformer blocks in the decoder (decoder depth) | | sh | Signifies that attention heads are shared | | skv | Signifies that key-values projection matrices are tied | If a model checkpoint has no specific, el or dl than both the number of encoder- and decoder layers correspond to nl. The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective. Note: This model is a pretrained checkpoint and has to be fine-tuned for practical usage. The checkpoint was pretrained in English and is therefore only useful for English NLP tasks. You can follow on of the following examples on how to fine-tune the model: - Summarization - Question Answering - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. - Summarization - Text Classification - Note: You will have to slightly adapt the training example here to make it work with an encoder-decoder model. We strongly recommend the reader to go carefully through the original paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers to get a more nuanced understanding of this model checkpoint. As explained in the following issue, checkpoints including the sh or skv model architecture variations have not been ported to Transformers as they are probably of limited practical usage and are lacking a more detailed description. Those checkpoints are kept here as they might be ported potentially in the future.

license:apache-2.0
217
8

t5-efficient-base

license:apache-2.0
211
10

switch-base-32

license:apache-2.0
210
10

pix2struct-widget-captioning-base

license:apache-2.0
210
6

byt5-xl

license:apache-2.0
208
12

paligemma2-10b-pt-448

NaNK
204
14

tapas-small-finetuned-wtq

license:apache-2.0
195
6

xtr-base-en

license:apache-2.0
192
6

metricx-23-qe-large-v2p0

license:apache-2.0
191
7

pegasus-x-large

189
20

switch-base-64

license:apache-2.0
184
3

t5-efficient-tiny-nl2

license:apache-2.0
181
0

Ul2

UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. For more information, please take a look at the original paper. Authors: Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler The checkpoint was iteratively pre-trained on C4 and fine-tuned on a variety of datasets The model is pretrained on the C4 corpus. For pretraining, the model is trained on a total of 1 trillion tokens on C4 (2 million steps) with a batch size of 1024. The sequence length is set to 512/512 for inputs and targets. Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384. The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8. The same same sentencepiece tokenizer as T5 of vocab size 32000 is used (click here for more information about the T5 tokenizer). UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs. UL-20B was trained using the Jax and T5X infrastructure. The training objective during pretraining is a mixture of different denoising strategies that are explained in the following: To quote the paper: > We conjecture that a strong universal model has to be exposed to solving diverse set of problems > during pre-training. Given that pre-training is done using self-supervision, we argue that such diversity > should be injected to the objective of the model, otherwise the model might suffer from lack a certain > ability, like long-coherent text generation. > Motivated by this, as well as current class of objective functions, we define three main paradigms that > are used during pre-training: - R-Denoiser: The regular denoising is the standard span corruption introduced in T5 that uses a range of 2 to 5 tokens as the span length, which masks about 15% of input tokens. These spans are short and potentially useful to acquire knowledge instead of learning to generate fluent text. - S-Denoiser: A specific case of denoising where we observe a strict sequential order when framing the inputs-to-targets task, i.e., prefix language modeling. To do so, we simply partition the input sequence into two sub-sequences of tokens as context and target such that the targets do not rely on future information. This is unlike standard span corruption where there could be a target token with earlier position than a context token. Note that similar to the Prefix-LM setup, the context (prefix) retains a bidirectional receptive field. We note that S-Denoising with very short memory or no memory is in similar spirit to standard causal language modeling. - X-Denoiser: An extreme version of denoising where the model must recover a large part of the input, given a small to moderate part of it. This simulates a situation where a model needs to generate long target from a memory with relatively limited information. To do so, we opt to include examples with aggressive denoising where approximately 50% of the input sequence is masked. This is by increasing the span length and/or corruption rate. We consider a pre-training task to be extreme if it has a long span (e.g., ≥ 12 tokens) or have a large corruption rate (e.g., ≥ 30%). X-denoising is motivated by being an interpolation between regular span corruption and language model like objectives. See the following diagram for a more visual explanation: Important: For more details, please see sections 3.1.2 of the paper. The model was continously fine-tuned after N pretraining steps where N is typically from 50k to 100k. In other words, after each Nk steps of pretraining, the model is finetuned on each downstream task. See section 5.2.2 of paper to get an overview of all datasets that were used for fine-tuning). As the model is continuously finetuned, finetuning is stopped on a task once it has reached state-of-the-art to save compute. In total, the model was trained for 2.65 million steps. Important: For more details, please see sections 5.2.1 and 5.2.2 of the paper. The following shows how one can predict masked passages using the different denoising strategies. Given the size of the model the following examples need to be run on at least a 40GB A100 GPU. For S-Denoising, please make sure to prompt the text with the prefix `[S2S]` as shown below. For R-Denoising, please make sure to prompt the text with the prefix `[NLU]` as shown below. For X-Denoising, please make sure to prompt the text with the prefix `[NLG]` as shown below.

license:apache-2.0
180
180

pix2struct-ocrvqa-base

license:apache-2.0
180
5

t5gemma-l-l-ul2

NaNK
177
4

paligemma2-28b-mix-448

NaNK
171
27

paligemma-3b-ft-ocrvqa-448

NaNK
170
6

pix2struct-screen2words-base

license:apache-2.0
165
24

gemma-3-27b-pt-qat-q4_0-gguf

NaNK
163
28

ddpm-ema-cat-256

license:apache-2.0
162
4

DiarizationLM-8b-Fisher-v1

NaNK
llama
162
3

switch-base-128

license:apache-2.0
159
5

ddpm-ema-bedroom-256

license:apache-2.0
158
3

gemma-2-2b-it-GGUF

NaNK
llama.cpp
157
84

gemma-7b-GGUF

NaNK
llama.cpp
155
21

timesfm-2.0-500m-jax

license:apache-2.0
153
16

t5gemma-b-b-ul2-it

NaNK
153
3

matcha-chart2text-pew

license:apache-2.0
149
40

madlad400-7b-mt-bt

NaNK
license:apache-2.0
149
7

gemma-2b-it-GGUF

NaNK
llama.cpp
143
19

codegemma-2b-GGUF

NaNK
llama.cpp
140
28

pegasus-billsum

133
4

paligemma-3b-ft-rsvqa-hr-448

NaNK
133
0

DiarizationLM-13b-Fisher-v1

NaNK
llama
130
11

matcha-chart2text-statista

license:apache-2.0
127
10

codegemma-1.1-7b-it

NaNK
125
51

tapas-large-finetuned-sqa

license:apache-2.0
125
7

bert2bert_L-24_wmt_en_de

NaNK
license:apache-2.0
125
5

bert_uncased_L-2_H-768_A-12

license:apache-2.0
123
4

t5-efficient-tiny-nh8

license:apache-2.0
121
0

switch-c-2048

license:apache-2.0
119
293

path-foundation

118
54

gemma-2b-GGUF

NaNK
llama.cpp
117
15

ddpm-bedroom-256

license:apache-2.0
116
6

t5gemma-9b-9b-prefixlm

NaNK
112
2

gemma-2b-AWQ

NaNK
112
0

shieldgemma-27b

NaNK
111
27

bert_uncased_L-6_H-512_A-8

license:apache-2.0
103
0

madlad400-8b-lm

NaNK
license:apache-2.0
102
10

gemma-2b-aps-it

NaNK
100
20

matcha-base

license:apache-2.0
99
27

gemma-7b-it-GGUF

NaNK
llama.cpp
98
44

metricx-23-qe-xxl-v2p0

license:apache-2.0
97
7

paligemma-3b-pt-896

NaNK
96
121

gemma-3-1b-pt-qat-q4_0-gguf

NaNK
93
12

bert_uncased_L-10_H-128_A-2

license:apache-2.0
93
0

gemma-2b-pytorch

NaNK
92
8

paligemma-3b-ft-docvqa-896

NaNK
91
9

paligemma2-10b-pt-896

NaNK
89
32

gemma-3-4b-pt-qat-q4_0-gguf

NaNK
89
23

paligemma-3b-ft-vqav2-448

NaNK
88
18

t5gemma-ml-ml-ul2

NaNK
88
0

pix2struct-screen2words-large

license:apache-2.0
87
21

tapas-large-finetuned-wikisql-supervised

license:apache-2.0
87
6

ncsnpp-ffhq-256

license:apache-2.0
86
4

t5gemma-ml-ml-prefixlm-it

NaNK
84
1

tapas-small

license:apache-2.0
84
0

gemma-2-2b-GGUF

NaNK
llama.cpp
82
17

byt5-xxl

license:apache-2.0
81
19

t5gemma-xl-xl-prefixlm

NaNK
81
1

paligemma-3b-ft-nlvr2-224

NaNK
80
1

electra-large-generator

license:apache-2.0
79
8

t5gemma-l-l-prefixlm-it

NaNK
79
1

bert_uncased_L-8_H-128_A-2

license:apache-2.0
79
0

paligemma-3b-ft-science-qa-224-jax

NaNK
77
0

gemma-2b-it-keras

NaNK
76
2

cxr-foundation

75
90

gemma-1.1-2b-it-GGUF

NaNK
llama.cpp
75
19

t5_11b_trueteacher_and_anli

NaNK
license:cc-by-nc-4.0
75
16

pix2struct-ocrvqa-large

license:apache-2.0
74
34

paligemma-3b-ft-vqav2-224

NaNK
74
2

gemma-7b-it-pytorch

NaNK
72
6

hear

71
29

gemma-7b-pytorch

NaNK
71
3

codegemma-1.1-2b-keras

NaNK
71
1

gemma-2-instruct-9b-keras

NaNK
68
7

t5-large-ssm-nq

license:apache-2.0
68
5

gemma-1.1-2b-it-keras

NaNK
68
3

gemma-1.1-2b-it-pytorch

NaNK
66
7

codegemma-7b-GGUF

NaNK
llama.cpp
65
22

gemma-1.1-7b-it-pytorch

NaNK
65
4

gemma-7b-AWQ

NaNK
64
0

paligemma-3b-ft-coco35l-448

NaNK
61
0

gemma-2-9b-keras

NaNK
60
7

paligemma-3b-ft-ocrvqa-224

NaNK
60
4

gemma-2b-keras

NaNK
59
5

t5gemma-xl-xl-ul2-it

NaNK
59
4

seahorse-large-q5

license:cc-by-4.0
58
0

gemma-1.1-7b-it-keras

NaNK
57
3

t5gemma-9b-2b-prefixlm-it

NaNK
57
1

seahorse-large-q4

license:cc-by-4.0
57
0

paligemma-3b-ft-ocrvqa-896

NaNK
56
17

gemma-7b-keras

NaNK
56
3

multiberts-seed_0-step_1900k

license:apache-2.0
56
0

paligemma-3b-ft-infovqa-896

NaNK
56
0

ncsnpp-ffhq-1024

license:apache-2.0
55
12

tapas-mini-finetuned-wtq

license:apache-2.0
55
3

codegemma-2b-keras

NaNK
55
2

codegemma-7b-it-keras

NaNK
55
2

tapas-large-finetuned-tabfact

license:apache-2.0
54
4

gemma-7b-it-keras

NaNK
54
2

switch-large-128

license:apache-2.0
53
6

codegemma-7b-keras

NaNK
53
1

bert_uncased_L-4_H-768_A-12

license:apache-2.0
53
0

codegemma-7b-it-GGUF

NaNK
llama.cpp
52
60

pegasus-reddit_tifu

50
3

t5-efficient-xl

license:apache-2.0
50
1

multiberts-seed_0

license:apache-2.0
50
0

multiberts-seed_1

license:apache-2.0
49
0

roberta2roberta_L-24_cnn_daily_mail

license:apache-2.0
48
6

t5-efficient-tiny-dl2

license:apache-2.0
47
0

tapas-tiny

license:apache-2.0
47
0

realm-cc-news-pretrained-embedder

license:apache-2.0
46
1

pix2struct-ai2d-large

license:apache-2.0
45
4

t5gemma-l-l-prefixlm

NaNK
45
3

t5-small-ssm

license:apache-2.0
45
2

matcha-plotqa-v2

license:apache-2.0
44
13

tapas-medium-finetuned-wtq

license:apache-2.0
44
2

switch-xxl-128

license:apache-2.0
43
12

metricx-23-xxl-v2p0

license:apache-2.0
43
9

bert_uncased_L-10_H-768_A-12

license:apache-2.0
43
0

tapas-medium-finetuned-wikisql-supervised

license:apache-2.0
43
0

pegasus-wikihow

42
7

pegasus-newsroom

41
16

tapas-small-finetuned-wikisql-supervised

license:apache-2.0
41
7

switch-base-256

license:apache-2.0
41
4

paligemma-3b-ft-textcaps-448

NaNK
41
2

roberta2roberta_L-24_wikisplit

license:apache-2.0
40
8

realm-orqa-nq-openqa

license:apache-2.0
40
4

t5-efficient-large-nl32

license:apache-2.0
40
1

gemma-2b-it-pytorch

NaNK
39
11

paligemma2-28b-pt-448

NaNK
39
10

paligemma-3b-ft-vizwizvqa-448

NaNK
39
2

bert_uncased_L-8_H-768_A-12

license:apache-2.0
39
1

paligemma2-3b-pt-448-keras

NaNK
39
0

t5gemma-9b-2b-ul2

NaNK
39
0

tapas-base-masklm

38
1

Pix2struct Widget Captioning Large

Model card for Pix2Struct - Finetuned on Widget Captioning (Captioning a UI component on a screen) - large version 0. TL;DR 1. Using the model 2. Contribution 3. Citation Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captionning and visual question answering. The full list of available models can be found on the Table 1 of the paper: The abstract of the model states that: > Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. You can use the `convertpix2structcheckpointtopytorch.py` script as follows: Once saved, you can push your converted model with the following snippet: The instructions for running the model are exactly the same as the instructions stated on `pix2struct-textcaps-base` model. This model was originally contributed by Kenton Lee, Mandar Joshi et al. and added to the Hugging Face ecosystem by Younes Belkada. If you want to cite this work, please consider citing the original paper:

license:apache-2.0
37
20

tapas-mini-finetuned-sqa

license:apache-2.0
37
4

codegemma-1.1-2b-GGUF

NaNK
llama.cpp
37
2

t5-large-ssm-nqo

license:apache-2.0
37
1

paligemma-3b-ft-screen2words-224

NaNK
37
1

bert_uncased_L-10_H-256_A-4

license:apache-2.0
37
0

t5-efficient-small-dl12

license:apache-2.0
37
0

t5gemma-ml-ml-prefixlm

NaNK
37
0

pix2struct-infographics-vqa-base

license:apache-2.0
36
8

gemma-7b-quant-pytorch

NaNK
36
2

bert_uncased_L-6_H-128_A-2

license:apache-2.0
36
1

t5-efficient-base-kv16

license:apache-2.0
36
0

t5-efficient-large-el8

license:apache-2.0
36
0

t5gemma-xl-xl-ul2

NaNK
36
0

t5-large-ssm

license:apache-2.0
35
3

pegasus-gigaword

35
2

t5-xl-ssm-nq

license:apache-2.0
35
1

tapas-medium-finetuned-sqa

license:apache-2.0
35
1

tapas-medium-masklm

35
1

bert_uncased_L-10_H-512_A-8

license:apache-2.0
35
0

t5-11b-ssm

NaNK
license:apache-2.0
35
0

t5-efficient-tiny-dl8

license:apache-2.0
35
0

tapas-medium

license:apache-2.0
35
0

tapas-small-finetuned-tabfact

license:apache-2.0
35
0

seahorse-large-q1

license:cc-by-4.0
35
0

paligemma-3b-ft-refcoco-seg-224

NaNK
35
0

t5-efficient-xxl

license:apache-2.0
34
27

xtr-base-multilingual

license:apache-2.0
34
9

seahorse-xxl-q1

license:cc-by-4.0
34
7

t5-efficient-large

license:apache-2.0
34
5

t5-efficient-small-el16

license:apache-2.0
34
2

t5-efficient-base-nl8

license:apache-2.0
34
1

t5-efficient-large-nl2

license:apache-2.0
34
1

t5-efficient-small-dm768

license:apache-2.0
34
1

t5-xxl-ssm-nq

license:apache-2.0
34
1

t5-11b-ssm-nq

NaNK
license:apache-2.0
34
0

t5-efficient-large-dl32

license:apache-2.0
34
0

t5-efficient-small-kv32

license:apache-2.0
34
0

t5-efficient-tiny-nl6

license:apache-2.0
34
0

tapas-medium-finetuned-tabfact

license:apache-2.0
34
0

tapas-mini-finetuned-tabfact

license:apache-2.0
34
0

tapas-tiny-finetuned-tabfact

license:apache-2.0
34
0

seahorse-large-q3

license:cc-by-4.0
34
0

paligemma2-3b-pt-224-keras

NaNK
34
0

paligemma2-3b-pt-896-keras

NaNK
34
0

paligemma2-10b-pt-224-keras

NaNK
34
0

pix2struct-textcaps-large

license:apache-2.0
33
14

gemma-7b-it-quant-pytorch

NaNK
33
11

t5-xxl-ssm

license:apache-2.0
33
4

t5-efficient-small-nl36

license:apache-2.0
33
3

t5-efficient-base-el8

license:apache-2.0
33
2

seahorse-large-q6

license:cc-by-4.0
33
2

paligemma-3b-mix-448-keras

NaNK
33
2

realm-orqa-nq-reader

license:apache-2.0
33
1

t5-efficient-base-dm2000

license:apache-2.0
33
1

t5-efficient-base-nl24

license:apache-2.0
33
1

t5-efficient-base-nl4

license:apache-2.0
33
1

t5-efficient-mini-nl24

license:apache-2.0
33
1