line-corporation
clip-japanese-base
This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval. The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16. Evaluation Dataset - STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval. - Recruit Datasets for image classification. - ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in ja-imagenet-1k-classnames.txt and ja-imagenet-1k-templates.txt. Result | Model | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) | |-------------------|----------------------|---------------------|----------------|------------------|-----------------| | Ours | 86M(Eva02-B) | 100M(BERT) | 0.30 | 0.89 | 0.58 | | Stable-ja-clip | 307M(ViT-L) | 100M(BERT) | 0.24 | 0.77 | 0.68 | | Rinna-ja-clip | 86M(ViT-B) | 100M(BERT) | 0.13 | 0.54 | 0.56 | | Laion-clip | 632M(ViT-H) | 561M(XLM-RoBERTa) | 0.30 | 0.83 | 0.58 | | Hakuhodo-ja-clip | 632M(ViT-H) | 100M(BERT) | 0.21 | 0.82 | 0.46 |