dandelin

8 models • 1 total models in database

Sort by:

vilt-b32-mlm

Vision-and-Language Transformer (ViLT), pre-trained only Vision-and-Language Transformer (ViLT) model pre-trained on GCC+SBU+COCO+VG (200k steps). It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository. Note: this model only includes the language modeling head. Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team. You can use the raw model for masked language modeling given an image and a piece of text with [MASK] tokens.

—

3,736,607

vilt-b32-finetuned-coco

--- license: apache-2.0 ---

license:apache-2.0

270,295

dandelin

vilt-b32-mlm

vilt-b32-finetuned-coco

vilt-b32-finetuned-vqa

vilt-b32-finetuned-nlvr2

vilt-b32-mlm-itm

vilt-b32-finetuned-flickr30k

hype-sampler-abl

hype-sampler