apple
mobilevit-small
--- license: other tags: - vision - image-classification datasets: - imagenet-1k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg example_title: Palace ---
DFN5B-CLIP-ViT-H-14-378
--- license: apple-amlr license_name: apple-sample-code-license license_link: LICENSE --- A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-5B. Data Filtering Networks (DFNs) are small networks used to automatically filter large pools of uncurated data. This model was trained on 5B images that were filtered from a pool of 43B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B + 30B additional public image-text pairs).
DFN5B-CLIP-ViT-H-14
MobileCLIP-S2-OpenCLIP
mobilevit-x-small
OpenELM-1_1B-Instruct
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari We introduce OpenELM, a family of Open Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. We pretrained OpenELM models using the CoreNet library. We release both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters. We release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research. Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens. Please check license agreements and terms of these datasets before using them. We have provided an example function to generate output from OpenELM models loaded via HuggingFace Hub in `generateopenelm.py`. You can try the model by running the following command: Please refer to this link to obtain your hugging face access token. Additional arguments to the hugging face generate function can be passed via `generatekwargs`. As an example, to speedup the inference, you can try lookup token speculative generation by passing the `promptlookupnumtokens` argument as follows: Alternatively, try model-wise speculative generation with an assistive model by passing a smaller model through the `assistantmodel` argument, for example: | Model Size | ARC-c | ARC-e | BoolQ | HellaSwag | PIQA | SciQ | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|-----------|-----------|---------------|-----------|-----------|----------------|-------------| | OpenELM-270M | 26.45 | 45.08 | 53.98 | 46.71 | 69.75 | 84.70 | 53.91 | 54.37 | | OpenELM-270M-Instruct | 30.55 | 46.68 | 48.56 | 52.07 | 70.78 | 84.40 | 52.72 | 55.11 | | OpenELM-450M | 27.56 | 48.06 | 55.78 | 53.97 | 72.31 | 87.20 | 58.01 | 57.56 | | OpenELM-450M-Instruct | 30.38 | 50.00 | 60.37 | 59.34 | 72.63 | 88.00 | 58.96 | 59.95 | | OpenELM-11B | 32.34 | 55.43 | 63.58 | 64.81 | 75.57 | 90.60 | 61.72 | 63.44 | | OpenELM-11B-Instruct | 37.97 | 52.23 | 70.00 | 71.20 | 75.03 | 89.30 | 62.75 | 65.50 | | OpenELM-3B | 35.58 | 59.89 | 67.40 | 72.44 | 78.24 | 92.70 | 65.51 | 67.39 | | OpenELM-3B-Instruct | 39.42 | 61.74 | 68.17 | 76.36 | 79.00 | 92.50 | 66.85 | 69.15 | | Model Size | ARC-c | HellaSwag | MMLU | TruthfulQA | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|---------------|-----------|----------------|----------------|-------------| | OpenELM-270M | 27.65 | 47.15 | 25.72 | 39.24 | 53.83 | 38.72 | | OpenELM-270M-Instruct | 32.51 | 51.58 | 26.70 | 38.72 | 53.20 | 40.54 | | OpenELM-450M | 30.20 | 53.86 | 26.01 | 40.18 | 57.22 | 41.50 | | OpenELM-450M-Instruct | 33.53 | 59.31 | 25.41 | 40.48 | 58.33 | 43.41 | | OpenELM-11B | 36.69 | 65.71 | 27.05 | 36.98 | 63.22 | 45.93 | | OpenELM-11B-Instruct | 41.55 | 71.83 | 25.65 | 45.95 | 64.72 | 49.94 | | OpenELM-3B | 42.24 | 73.28 | 26.76 | 34.98 | 67.25 | 48.90 | | OpenELM-3B-Instruct | 47.70 | 76.87 | 24.80 | 38.76 | 67.96 | 51.22 | | Model Size | ARC-c | CrowS-Pairs | HellaSwag | MMLU | PIQA | RACE | TruthfulQA | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|-----------------|---------------|-----------|-----------|-----------|----------------|----------------|-------------| | OpenELM-270M | 27.65 | 66.79 | 47.15 | 25.72 | 69.75 | 30.91 | 39.24 | 53.83 | 45.13 | | OpenELM-270M-Instruct | 32.51 | 66.01 | 51.58 | 26.70 | 70.78 | 33.78 | 38.72 | 53.20 | 46.66 | | OpenELM-450M | 30.20 | 68.63 | 53.86 | 26.01 | 72.31 | 33.11 | 40.18 | 57.22 | 47.69 | | OpenELM-450M-Instruct | 33.53 | 67.44 | 59.31 | 25.41 | 72.63 | 36.84 | 40.48 | 58.33 | 49.25 | | OpenELM-11B | 36.69 | 71.74 | 65.71 | 27.05 | 75.57 | 36.46 | 36.98 | 63.22 | 51.68 | | OpenELM-11B-Instruct | 41.55 | 71.02 | 71.83 | 25.65 | 75.03 | 39.43 | 45.95 | 64.72 | 54.40 | | OpenELM-3B | 42.24 | 73.29 | 73.28 | 26.76 | 78.24 | 38.76 | 34.98 | 67.25 | 54.35 | | OpenELM-3B-Instruct | 47.70 | 72.33 | 76.87 | 24.80 | 79.00 | 38.47 | 38.76 | 67.96 | 55.73 | See the technical report for more results and comparison. The release of OpenELM models aims to empower and enrich the open research community by providing access to state-of-the-art language models. Trained on publicly available datasets, these models are made available without any safety guarantees. Consequently, there exists the possibility of these models producing outputs that are inaccurate, harmful, biased, or objectionable in response to user prompts. Thus, it is imperative for users and developers to undertake thorough safety testing and implement appropriate filtering mechanisms tailored to their specific requirements.
DepthPro
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Depth Pro was introduced in Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, by Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. The checkpoint in this repository is a reference implementation, which has been re-trained. Its performance is close to the model reported in the paper but does not match it exactly. Please, follow the steps in the code repository to set up your environment. Then you can download the checkpoint from the Files and versions tab above, or use the `huggingface-hub` CLI: The code repo provides a helper script to run the model on a single image: Boundary metrics are implemented in `eval/boundarymetrics.py` and can be used as follows: If you find our work useful, please cite the following paper: Our codebase is built using multiple opensource contributions, please see Acknowledgements for more details. Please check the paper for a complete list of references and datasets used in this work.
DepthPro-hf
This is the transformers version of DepthPro, a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. For the checkpoint compatible with the original codebase, please check this repo. - DepthPro: Monocular Depth Estimation - Table of Contents - Model Details - Model Sources - How to Get Started with the Model - Training Details - Training Data - Preprocessing - Training Hyperparameters - Evaluation - Model Architecture and Objective - Citation - Model Card Authors DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation. > We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. This is the model card of a 🤗 transformers model that has been pushed on the Hub. - Developed by: Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun. - Model type: DepthPro - License: Apple-ASCL - HF Docs: DepthPro - Repository: https://github.com/apple/ml-depth-pro - Paper: https://arxiv.org/abs/2410.02073 The DepthPro model was trained on the following datasets: Images go through the following preprocessing steps: - rescaled by `1/225.` - normalized with `mean=[0.5, 0.5, 0.5]` and `std=[0.5, 0.5, 0.5]` - resized to `1536x1536` pixels The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder. The `DepthProEncoder` further uses two encoders: - `patchencoder` - Input image is scaled with multiple ratios, as specified in the `scaledimagesratios` configuration. - Each scaled image is split into smaller patches of size `patchsize` with overlapping areas determined by `scaledimagesoverlapratios`. - These patches are processed by the `patchencoder` - `imageencoder` - Input image is also rescaled to `patchsize` and processed by the `imageencoder` Both these encoders can be configured via `patchmodelconfig` and `imagemodelconfig` respectively, both of which are separate `Dinov2Model` by default. Outputs from both encoders (`lasthiddenstate`) and selected intermediate states (`hiddenstates`) from `patchencoder` are fused by a `DPT`-based `FeatureFusionStage` for depth estimation. The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task-specific features from a separate ViT image encoder to predict the horizontal angular field-of-view.
DFN2B-CLIP-ViT-B-16
DFN2B-CLIP-ViT-L-14
FastVLM-0.5B
FastVLM: Efficient Vision Encoding for Vision Language Models FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) [//]: # () Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. Evaluations | Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |:--------------|:------------:|:------------:|:----------:| | Ai2D | 68.0 | 77.4 | 83.6 | | ScienceQA | 85.2 | 94.4 | 96.7 | | MMMU | 33.9 | 37.8 | 45.4 | | VQAv2 | 76.3 | 79.1 | 80.8 | | ChartQA | 76.0 | 80.1 | 85.0 | | TextVQA | 64.5 | 70.4 | 74.9 | | InfoVQA | 46.4 | 59.7 | 75.8 | | DocVQA | 82.5 | 88.3 | 93.2 | | OCRBench | 63.9 | 70.2 | 73.1 | | RealWorldQA | 56.1 | 61.2 | 67.2 | | SeedBench-Img | 71.0 | 74.2 | 75.4 | Usage Example To run inference of PyTorch checkpoint, follow the instruction in the official repo: Run inference using `predict.py` from the official repo. Run inference with Transformers (Remote Code) To run inference with transformers we can leverage `trustremotecode` along with the following snippet: Citation If you found this model useful, please cite the following paper:
DFN2B-CLIP-ViT-L-14-39B
MobileCLIP-S1-OpenCLIP
mobilevitv2-1.0-imagenet1k-256
mobilevit-xx-small
MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license. Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team. MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. The MobileViT model was pretrained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes. Training requires only basic data augmentation, i.e. random resized cropping and horizontal flipping. To learn multi-scale representations without requiring fine-tuning, a multi-scale sampler was used during training, with image sizes randomly sampled from: (160, 160), (192, 192), (256, 256), (288, 288), (320, 320). At inference time, images are resized/rescaled to the same resolution (288x288), and center-cropped at 256x256. Pixels are normalized to the range [0, 1]. Images are expected to be in BGR pixel order, not RGB. The MobileViT networks are trained from scratch for 300 epochs on ImageNet-1k on 8 NVIDIA GPUs with an effective batch size of 1024 and learning rate warmup for 3k steps, followed by cosine annealing. Also used were label smoothing cross-entropy loss and L2 weight decay. Training resolution varies from 160x160 to 320x320, using multi-scale sampling. | Model | ImageNet top-1 accuracy | ImageNet top-5 accuracy | # params | URL | |-------------------|-------------------------|-------------------------|-----------|-------------------------------------------------| | MobileViT-XXS | 69.0 | 88.9 | 1.3 M | https://huggingface.co/apple/mobilevit-xx-small | | MobileViT-XS | 74.8 | 92.3 | 2.3 M | https://huggingface.co/apple/mobilevit-x-small | | MobileViT-S | 78.4 | 94.1 | 5.6 M | https://huggingface.co/apple/mobilevit-small |
OpenELM-3B-Instruct
deeplabv3-mobilevit-small
MobileCLIP-B-LT-OpenCLIP
FastVLM-1.5B
FastVLM: Efficient Vision Encoding for Vision Language Models FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) [//]: # () Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. Evaluations | Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |:--------------|:------------:|:------------:|:----------:| | Ai2D | 68.0 | 77.4 | 83.6 | | ScienceQA | 85.2 | 94.4 | 96.7 | | MMMU | 33.9 | 37.8 | 45.4 | | VQAv2 | 76.3 | 79.1 | 80.8 | | ChartQA | 76.0 | 80.1 | 85.0 | | TextVQA | 64.5 | 70.4 | 74.9 | | InfoVQA | 46.4 | 59.7 | 75.8 | | DocVQA | 82.5 | 88.3 | 93.2 | | OCRBench | 63.9 | 70.2 | 73.1 | | RealWorldQA | 56.1 | 61.2 | 67.2 | | SeedBench-Img | 71.0 | 74.2 | 75.4 | Usage Example To run inference of PyTorch checkpoint, follow the instruction in the official repo: Run inference using `predict.py` from the official repo. Run inference with Transformers (Remote Code) To run inference with transformers we can leverage `trustremotecode` along with the following snippet: Citation If you found this model useful, please cite the following paper:
aimv2-large-patch14-224
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
FastVLM-7B
OpenELM-270M
DiffuCoder-7B-Instruct
DiffuCoder-7B-cpGRPO
The DiffuCoder-7B-cpGRPO variant further refines DiffuCoder-Instruct with reinforcement learning via Coupled-GRPO. - Initialized from DiffuCoder-7B-Instruct, post-training with coupled-GRPO on 21K code data (1 epoch). - coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. - Paper: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Acknowledgement To power this HuggingFace model release, we reuse Dream's modeling architecture and generation utils.
aimv2-huge-patch14-448
OpenELM-3B
coreml-mobileclip
OpenELM-270M-Instruct
deeplabv3-mobilevit-xx-small
aimv2-large-patch14-224-lit
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
OpenELM-450M
OpenELM-450M-Instruct
aimv2-large-patch14-native
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
OpenELM-1_1B
MobileCLIP-B-OpenCLIP
DiffuCoder-7B-Base
The DiffuCoder-7B-Base model is our foundational masked diffusion LLM for code generation. - Training recipe: Using DiffuLLaMA's adaptation approach, trained on a large corpus of code: with Stage 1 65B tokens and Stage 2 65B tokens. - Benchmarks: Strong baseline performance on HumanEval, MBPP and BigCodeBench. - Paper: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Acknowledgement To power this HuggingFace model release, we reuse Dream's modeling architecture and generation utils.
FastVLM 0.5B Fp16
mobileclip_b_lt_timm
Coreml Stable Diffusion V1 5
This model was generated by Hugging Face using Apple’s repository which has ASCL. Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. For more information about how Stable Diffusion functions, please have a look at 🤗's Stable Diffusion blog. The Stable-Diffusion-v1-5 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "la...
coreml-stable-diffusion-2-1-base
aimv2-1B-patch14-224
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
mobileclip_s0_timm
mobileclip_s2_timm
coreml-depth-anything-v2-small
coreml-YOLOv3
aimv2-large-patch14-336
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
MobileCLIP2-S2
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S2 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
coreml-detr-semantic-segmentation
aimv2-large-patch14-448
TiC-CLIP-basic-sequential
coreml-stable-diffusion-v1-4
MobileCLIP2-S0
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S0 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
coreml-stable-diffusion-v1-5-palettized
MobileCLIP-S1
MobileCLIP-S0
coreml-stable-diffusion-2-base
FastVLM-7B-int4
TiC-CLIP-bestpool-cumulative
TiC-CLIP-bestpool-sequential
MobileCLIP-S2
coreml-sam2.1-tiny
deeplabv3-mobilevit-x-small
mobileclip_s1_timm
coreml-depth-anything-small
coreml-resnet-50
coreml-stable-diffusion-2-1-base-palettized
coreml-stable-diffusion-xl-base
MobileCLIP2-S3
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S3 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
MobileCLIP2-B
FastVLM-1.5B-int8
coreml-sam2.1-small
coreml-sam2.1-baseplus
coreml-sam2.1-large
TiC-CLIP-basic-cumulative
mobilevitv2-2.0-imagenet1k-256
mistral-coreml
DFN-public
aimv2-large-patch14-224-distilled
MobileCLIP2-S4
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S4 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
coreml-stable-diffusion-xl-base-with-refiner
mobileclip_b_timm
coreml-stable-diffusion-2-base-palettized
aimv2-huge-patch14-224
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
MobileCLIP-B-LT
coreml-stable-diffusion-mixed-bit-palettization
MobileCLIP2-L-14
coreml-sam2-large
MobileCLIP-L-14
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-L-14 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
aimv2-large-patch14-336-distilled
MobileCLIP-S4
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-S4 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
aimv2-3B-patch14-224
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
aimv2-1B-patch14-448
aimv2-3B-patch14-448
MobileCLIP-B
TiC-CLIP-basic-oracle
MobileCLIP-S3
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-S3 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:
coreml-sam2-tiny
aimv2-3B-patch14-336
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
coreml-sam2-small
coreml-stable-diffusion-1-4-palettized
aimv2-huge-patch14-336
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
SimpleSD-4B-thinking
aimv2-1B-patch14-336
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:
coreml-FastViT-T8
AIM-7B
SimpleSD-30B-instruct
mobilevitv2-1.0-voc-deeplabv3
ane-distilbert-base-uncased-finetuned-sst-2-english
coreml-FastViT-MA36
DepthPro-mixin
mobilevitv2-1.5-voc-deeplabv3
AIM-600M
mobileclip2_coca_dfn2b_s13b_dci-complete_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_dci-short_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
TiC-CLIP-bestpool-oracle
AIM-1B
coreml-sam2-baseplus
mobileclip2_coca_dfn2b_s13b_context77
mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_gbc1m-short_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_mscoco38k_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context256
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_docci_s12m_context256
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_docci_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.
mobileclip2_coca_dfn2b_s13b_recap-coco-30k_s12m_context77
MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | | MobileCLIP2-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 71.5 | 59.7 | | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.