OFA-Sys
chinese-clip-vit-base-patch16
Introduction This is the base-version of the Chinese CLIP, with ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP (Welcome to star! 🔥🔥) Use with the official API We provide a simple code snippet to show how to use the API of Chinese-CLIP to compute the image & text embeddings and similarities. However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference. Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6 R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0 CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0 Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0 R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7 CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2 Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2 ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0 CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0 CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9 Citation If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!