apple

139 models • 14 total models in database
Sort by:

mobilevit-small

--- license: other tags: - vision - image-classification datasets: - imagenet-1k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg example_title: Palace ---

1,430,397
82

DFN5B-CLIP-ViT-H-14-378

--- license: apple-amlr license_name: apple-sample-code-license license_link: LICENSE --- A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-5B. Data Filtering Networks (DFNs) are small networks used to automatically filter large pools of uncurated data. This model was trained on 5B images that were filtered from a pool of 43B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B + 30B additional public image-text pairs).

NaNK
279,269
97

DFN5B-CLIP-ViT-H-14

NaNK
145,977
45

MobileCLIP-S2-OpenCLIP

95,683
18

mobilevit-x-small

28,687
8

OpenELM-1_1B-Instruct

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari We introduce OpenELM, a family of Open Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. We pretrained OpenELM models using the CoreNet library. We release both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters. We release the complete framework, encompassing data preparation, training, fine-tuning, and evaluation procedures, alongside multiple pre-trained checkpoints and training logs, to facilitate open research. Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens. Please check license agreements and terms of these datasets before using them. We have provided an example function to generate output from OpenELM models loaded via HuggingFace Hub in `generateopenelm.py`. You can try the model by running the following command: Please refer to this link to obtain your hugging face access token. Additional arguments to the hugging face generate function can be passed via `generatekwargs`. As an example, to speedup the inference, you can try lookup token speculative generation by passing the `promptlookupnumtokens` argument as follows: Alternatively, try model-wise speculative generation with an assistive model by passing a smaller model through the `assistantmodel` argument, for example: | Model Size | ARC-c | ARC-e | BoolQ | HellaSwag | PIQA | SciQ | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|-----------|-----------|---------------|-----------|-----------|----------------|-------------| | OpenELM-270M | 26.45 | 45.08 | 53.98 | 46.71 | 69.75 | 84.70 | 53.91 | 54.37 | | OpenELM-270M-Instruct | 30.55 | 46.68 | 48.56 | 52.07 | 70.78 | 84.40 | 52.72 | 55.11 | | OpenELM-450M | 27.56 | 48.06 | 55.78 | 53.97 | 72.31 | 87.20 | 58.01 | 57.56 | | OpenELM-450M-Instruct | 30.38 | 50.00 | 60.37 | 59.34 | 72.63 | 88.00 | 58.96 | 59.95 | | OpenELM-11B | 32.34 | 55.43 | 63.58 | 64.81 | 75.57 | 90.60 | 61.72 | 63.44 | | OpenELM-11B-Instruct | 37.97 | 52.23 | 70.00 | 71.20 | 75.03 | 89.30 | 62.75 | 65.50 | | OpenELM-3B | 35.58 | 59.89 | 67.40 | 72.44 | 78.24 | 92.70 | 65.51 | 67.39 | | OpenELM-3B-Instruct | 39.42 | 61.74 | 68.17 | 76.36 | 79.00 | 92.50 | 66.85 | 69.15 | | Model Size | ARC-c | HellaSwag | MMLU | TruthfulQA | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|---------------|-----------|----------------|----------------|-------------| | OpenELM-270M | 27.65 | 47.15 | 25.72 | 39.24 | 53.83 | 38.72 | | OpenELM-270M-Instruct | 32.51 | 51.58 | 26.70 | 38.72 | 53.20 | 40.54 | | OpenELM-450M | 30.20 | 53.86 | 26.01 | 40.18 | 57.22 | 41.50 | | OpenELM-450M-Instruct | 33.53 | 59.31 | 25.41 | 40.48 | 58.33 | 43.41 | | OpenELM-11B | 36.69 | 65.71 | 27.05 | 36.98 | 63.22 | 45.93 | | OpenELM-11B-Instruct | 41.55 | 71.83 | 25.65 | 45.95 | 64.72 | 49.94 | | OpenELM-3B | 42.24 | 73.28 | 26.76 | 34.98 | 67.25 | 48.90 | | OpenELM-3B-Instruct | 47.70 | 76.87 | 24.80 | 38.76 | 67.96 | 51.22 | | Model Size | ARC-c | CrowS-Pairs | HellaSwag | MMLU | PIQA | RACE | TruthfulQA | WinoGrande | Average | |-----------------------------------------------------------------------------|-----------|-----------------|---------------|-----------|-----------|-----------|----------------|----------------|-------------| | OpenELM-270M | 27.65 | 66.79 | 47.15 | 25.72 | 69.75 | 30.91 | 39.24 | 53.83 | 45.13 | | OpenELM-270M-Instruct | 32.51 | 66.01 | 51.58 | 26.70 | 70.78 | 33.78 | 38.72 | 53.20 | 46.66 | | OpenELM-450M | 30.20 | 68.63 | 53.86 | 26.01 | 72.31 | 33.11 | 40.18 | 57.22 | 47.69 | | OpenELM-450M-Instruct | 33.53 | 67.44 | 59.31 | 25.41 | 72.63 | 36.84 | 40.48 | 58.33 | 49.25 | | OpenELM-11B | 36.69 | 71.74 | 65.71 | 27.05 | 75.57 | 36.46 | 36.98 | 63.22 | 51.68 | | OpenELM-11B-Instruct | 41.55 | 71.02 | 71.83 | 25.65 | 75.03 | 39.43 | 45.95 | 64.72 | 54.40 | | OpenELM-3B | 42.24 | 73.29 | 73.28 | 26.76 | 78.24 | 38.76 | 34.98 | 67.25 | 54.35 | | OpenELM-3B-Instruct | 47.70 | 72.33 | 76.87 | 24.80 | 79.00 | 38.47 | 38.76 | 67.96 | 55.73 | See the technical report for more results and comparison. The release of OpenELM models aims to empower and enrich the open research community by providing access to state-of-the-art language models. Trained on publicly available datasets, these models are made available without any safety guarantees. Consequently, there exists the possibility of these models producing outputs that are inaccurate, harmful, biased, or objectionable in response to user prompts. Thus, it is imperative for users and developers to undertake thorough safety testing and implement appropriate filtering mechanisms tailored to their specific requirements.

NaNK
28,664
68

DepthPro

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Depth Pro was introduced in Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, by Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. The checkpoint in this repository is a reference implementation, which has been re-trained. Its performance is close to the model reported in the paper but does not match it exactly. Please, follow the steps in the code repository to set up your environment. Then you can download the checkpoint from the Files and versions tab above, or use the `huggingface-hub` CLI: The code repo provides a helper script to run the model on a single image: Boundary metrics are implemented in `eval/boundarymetrics.py` and can be used as follows: If you find our work useful, please cite the following paper: Our codebase is built using multiple opensource contributions, please see Acknowledgements for more details. Please check the paper for a complete list of references and datasets used in this work.

25,102
486

DepthPro-hf

This is the transformers version of DepthPro, a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. For the checkpoint compatible with the original codebase, please check this repo. - DepthPro: Monocular Depth Estimation - Table of Contents - Model Details - Model Sources - How to Get Started with the Model - Training Details - Training Data - Preprocessing - Training Hyperparameters - Evaluation - Model Architecture and Objective - Citation - Model Card Authors DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation. > We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. This is the model card of a 🤗 transformers model that has been pushed on the Hub. - Developed by: Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun. - Model type: DepthPro - License: Apple-ASCL - HF Docs: DepthPro - Repository: https://github.com/apple/ml-depth-pro - Paper: https://arxiv.org/abs/2410.02073 The DepthPro model was trained on the following datasets: Images go through the following preprocessing steps: - rescaled by `1/225.` - normalized with `mean=[0.5, 0.5, 0.5]` and `std=[0.5, 0.5, 0.5]` - resized to `1536x1536` pixels The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder. The `DepthProEncoder` further uses two encoders: - `patchencoder` - Input image is scaled with multiple ratios, as specified in the `scaledimagesratios` configuration. - Each scaled image is split into smaller patches of size `patchsize` with overlapping areas determined by `scaledimagesoverlapratios`. - These patches are processed by the `patchencoder` - `imageencoder` - Input image is also rescaled to `patchsize` and processed by the `imageencoder` Both these encoders can be configured via `patchmodelconfig` and `imagemodelconfig` respectively, both of which are separate `Dinov2Model` by default. Outputs from both encoders (`lasthiddenstate`) and selected intermediate states (`hiddenstates`) from `patchencoder` are fused by a `DPT`-based `FeatureFusionStage` for depth estimation. The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task-specific features from a separate ViT image encoder to predict the horizontal angular field-of-view.

25,064
71

DFN2B-CLIP-ViT-B-16

NaNK
18,839
10

DFN2B-CLIP-ViT-L-14

NaNK
13,817
16

FastVLM-0.5B

FastVLM: Efficient Vision Encoding for Vision Language Models FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) [//]: # (![FastViTHD Performance](accvslatencyqwen-2.png)) Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. Evaluations | Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |:--------------|:------------:|:------------:|:----------:| | Ai2D | 68.0 | 77.4 | 83.6 | | ScienceQA | 85.2 | 94.4 | 96.7 | | MMMU | 33.9 | 37.8 | 45.4 | | VQAv2 | 76.3 | 79.1 | 80.8 | | ChartQA | 76.0 | 80.1 | 85.0 | | TextVQA | 64.5 | 70.4 | 74.9 | | InfoVQA | 46.4 | 59.7 | 75.8 | | DocVQA | 82.5 | 88.3 | 93.2 | | OCRBench | 63.9 | 70.2 | 73.1 | | RealWorldQA | 56.1 | 61.2 | 67.2 | | SeedBench-Img | 71.0 | 74.2 | 75.4 | Usage Example To run inference of PyTorch checkpoint, follow the instruction in the official repo: Run inference using `predict.py` from the official repo. Run inference with Transformers (Remote Code) To run inference with transformers we can leverage `trustremotecode` along with the following snippet: Citation If you found this model useful, please cite the following paper:

NaNK
12,732
345

DFN2B-CLIP-ViT-L-14-39B

NaNK
9,112
7

MobileCLIP-S1-OpenCLIP

6,105
10

mobilevitv2-1.0-imagenet1k-256

6,022
10

mobilevit-xx-small

MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license. Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team. MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: Currently, both the feature extractor and model support PyTorch. The MobileViT model was pretrained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes. Training requires only basic data augmentation, i.e. random resized cropping and horizontal flipping. To learn multi-scale representations without requiring fine-tuning, a multi-scale sampler was used during training, with image sizes randomly sampled from: (160, 160), (192, 192), (256, 256), (288, 288), (320, 320). At inference time, images are resized/rescaled to the same resolution (288x288), and center-cropped at 256x256. Pixels are normalized to the range [0, 1]. Images are expected to be in BGR pixel order, not RGB. The MobileViT networks are trained from scratch for 300 epochs on ImageNet-1k on 8 NVIDIA GPUs with an effective batch size of 1024 and learning rate warmup for 3k steps, followed by cosine annealing. Also used were label smoothing cross-entropy loss and L2 weight decay. Training resolution varies from 160x160 to 320x320, using multi-scale sampling. | Model | ImageNet top-1 accuracy | ImageNet top-5 accuracy | # params | URL | |-------------------|-------------------------|-------------------------|-----------|-------------------------------------------------| | MobileViT-XXS | 69.0 | 88.9 | 1.3 M | https://huggingface.co/apple/mobilevit-xx-small | | MobileViT-XS | 74.8 | 92.3 | 2.3 M | https://huggingface.co/apple/mobilevit-x-small | | MobileViT-S | 78.4 | 94.1 | 5.6 M | https://huggingface.co/apple/mobilevit-small |

5,509
20

OpenELM-3B-Instruct

NaNK
2,331
339

deeplabv3-mobilevit-small

2,306
16

MobileCLIP-B-LT-OpenCLIP

1,809
9

FastVLM-1.5B

FastVLM: Efficient Vision Encoding for Vision Language Models FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025) [//]: # (![FastViTHD Performance](accvslatencyqwen-2.png)) Highlights We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. Evaluations | Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |:--------------|:------------:|:------------:|:----------:| | Ai2D | 68.0 | 77.4 | 83.6 | | ScienceQA | 85.2 | 94.4 | 96.7 | | MMMU | 33.9 | 37.8 | 45.4 | | VQAv2 | 76.3 | 79.1 | 80.8 | | ChartQA | 76.0 | 80.1 | 85.0 | | TextVQA | 64.5 | 70.4 | 74.9 | | InfoVQA | 46.4 | 59.7 | 75.8 | | DocVQA | 82.5 | 88.3 | 93.2 | | OCRBench | 63.9 | 70.2 | 73.1 | | RealWorldQA | 56.1 | 61.2 | 67.2 | | SeedBench-Img | 71.0 | 74.2 | 75.4 | Usage Example To run inference of PyTorch checkpoint, follow the instruction in the official repo: Run inference using `predict.py` from the official repo. Run inference with Transformers (Remote Code) To run inference with transformers we can leverage `trustremotecode` along with the following snippet: Citation If you found this model useful, please cite the following paper:

NaNK
1,769
71

aimv2-large-patch14-224

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

1,672
58

FastVLM-7B

NaNK
1,644
261

OpenELM-270M

1,189
75

DiffuCoder-7B-Instruct

NaNK
949
54

DiffuCoder-7B-cpGRPO

The DiffuCoder-7B-cpGRPO variant further refines DiffuCoder-Instruct with reinforcement learning via Coupled-GRPO. - Initialized from DiffuCoder-7B-Instruct, post-training with coupled-GRPO on 21K code data (1 epoch). - coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. - Paper: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Acknowledgement To power this HuggingFace model release, we reuse Dream's modeling architecture and generation utils.

NaNK
933
312

aimv2-huge-patch14-448

926
5

OpenELM-3B

NaNK
852
128

coreml-mobileclip

NaNK
801
47

OpenELM-270M-Instruct

796
140

deeplabv3-mobilevit-xx-small

685
10

aimv2-large-patch14-224-lit

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

671
6

OpenELM-450M

556
26

OpenELM-450M-Instruct

500
49

aimv2-large-patch14-native

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

353
12

OpenELM-1_1B

NaNK
345
33

MobileCLIP-B-OpenCLIP

334
3

DiffuCoder-7B-Base

The DiffuCoder-7B-Base model is our foundational masked diffusion LLM for code generation. - Training recipe: Using DiffuLLaMA's adaptation approach, trained on a large corpus of code: with Stage 1 65B tokens and Stage 2 65B tokens. - Benchmarks: Strong baseline performance on HumanEval, MBPP and BigCodeBench. - Paper: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Acknowledgement To power this HuggingFace model release, we reuse Dream's modeling architecture and generation utils.

NaNK
325
25

FastVLM 0.5B Fp16

NaNK
262
16

mobileclip_b_lt_timm

260
5

Coreml Stable Diffusion V1 5

This model was generated by Hugging Face using Apple’s repository which has ASCL. Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. For more information about how Stable Diffusion functions, please have a look at 🤗's Stable Diffusion blog. The Stable-Diffusion-v1-5 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "la...

228
64

coreml-stable-diffusion-2-1-base

223
51

aimv2-1B-patch14-224

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

NaNK
203
7

mobileclip_s0_timm

178
10

mobileclip_s2_timm

155
5

coreml-depth-anything-v2-small

license:apache-2.0
146
75

coreml-YOLOv3

license:mit
131
15

aimv2-large-patch14-336

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

124
3

MobileCLIP2-S2

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S2 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

107
7

coreml-detr-semantic-segmentation

license:apache-2.0
103
30

aimv2-large-patch14-448

101
7

TiC-CLIP-basic-sequential

101
1

coreml-stable-diffusion-v1-4

98
30

MobileCLIP2-S0

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S0 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

95
37

coreml-stable-diffusion-v1-5-palettized

94
6

MobileCLIP-S1

90
11

MobileCLIP-S0

88
11

coreml-stable-diffusion-2-base

85
81

FastVLM-7B-int4

NaNK
85
23

TiC-CLIP-bestpool-cumulative

85
3

TiC-CLIP-bestpool-sequential

81
0

MobileCLIP-S2

79
10

coreml-sam2.1-tiny

license:apache-2.0
79
9

deeplabv3-mobilevit-x-small

78
3

mobileclip_s1_timm

77
2

coreml-depth-anything-small

license:apache-2.0
76
37

coreml-resnet-50

license:mit
74
7

coreml-stable-diffusion-2-1-base-palettized

71
19

coreml-stable-diffusion-xl-base

69
68

MobileCLIP2-S3

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S3 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

69
4

MobileCLIP2-B

69
0

FastVLM-1.5B-int8

NaNK
67
15

coreml-sam2.1-small

license:apache-2.0
66
2

coreml-sam2.1-baseplus

license:apache-2.0
65
1

coreml-sam2.1-large

license:apache-2.0
63
23

TiC-CLIP-basic-cumulative

56
1

mobilevitv2-2.0-imagenet1k-256

56
0

mistral-coreml

license:apache-2.0
53
68

DFN-public

49
3

aimv2-large-patch14-224-distilled

49
0

MobileCLIP2-S4

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S4 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

48
12

coreml-stable-diffusion-xl-base-with-refiner

47
12

mobileclip_b_timm

44
2

coreml-stable-diffusion-2-base-palettized

43
3

aimv2-huge-patch14-224

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

41
12

MobileCLIP-B-LT

39
10

coreml-stable-diffusion-mixed-bit-palettization

37
26

MobileCLIP2-L-14

35
1

coreml-sam2-large

license:apache-2.0
34
28

MobileCLIP-L-14

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-L-14 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

27
0

aimv2-large-patch14-336-distilled

26
5

MobileCLIP-S4

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-S4 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

25
4

aimv2-3B-patch14-224

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

NaNK
25
3

aimv2-1B-patch14-448

NaNK
25
0

aimv2-3B-patch14-448

NaNK
24
12

MobileCLIP-B

23
4

TiC-CLIP-basic-oracle

23
0

MobileCLIP-S3

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP-S3 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: Then, install `ml-mobileclip` by following the instructions in the repo. It uses an API similar to `openclip`'s. You can run inference with a code snippet like the following:

22
1

coreml-sam2-tiny

license:apache-2.0
21
16

aimv2-3B-patch14-336

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

NaNK
20
4

coreml-sam2-small

license:apache-2.0
19
5

coreml-stable-diffusion-1-4-palettized

18
7

aimv2-huge-patch14-336

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

18
0

SimpleSD-4B-thinking

NaNK
17
0

aimv2-1B-patch14-336

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include: 1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. 2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension. 3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk. Citation If you find our work useful, please consider citing us as:

NaNK
17
0

coreml-FastViT-T8

16
14

AIM-7B

NaNK
14
24

SimpleSD-30B-instruct

NaNK
14
1

mobilevitv2-1.0-voc-deeplabv3

11
2

ane-distilbert-base-uncased-finetuned-sst-2-english

license:apache-2.0
10
17

coreml-FastViT-MA36

10
10

DepthPro-mixin

10
7

mobilevitv2-1.5-voc-deeplabv3

10
0

AIM-600M

8
18

mobileclip2_coca_dfn2b_s13b_dci-complete_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
8
0

mobileclip2_coca_dfn2b_s13b_dci-short_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
8
0

TiC-CLIP-bestpool-oracle

7
0

AIM-1B

NaNK
6
5

coreml-sam2-baseplus

license:apache-2.0
6
1

mobileclip2_coca_dfn2b_s13b_context77

NaNK
6
0

mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
6
0

mobileclip2_coca_dfn2b_s13b_gbc1m-short_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
6
0

mobileclip2_coca_dfn2b_s13b_mscoco38k_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
6
0

mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context256

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
5
0

mobileclip2_coca_dfn2b_s13b_docci_s12m_context256

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
5
0

mobileclip2_coca_dfn2b_s13b_docci_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
5
0

mobileclip2_coca_dfn2b_s13b_recap-coco-30k_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | | MobileCLIP2-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 71.5 | 59.7 | | MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint. For programmatic downloading, if you have `huggingfacehub` installed, you can also run: For models length with context lengths 128/256, copy `config.json` to `src/openclip/modelconfigs/cocaViT-L-14-context$len.json` and change the model name in below example to `cocaViT-L-14-context$len`.

NaNK
5
0

sage-ft-mixtral-8x7b

NaNK
license:apache-2.0
4
24

SimpleSD-4B-instruct

NaNK
4
0

mobileclip2_coca_dfn2b_s13b_dci-complete_s12m_context256

NaNK
4
0

DCLM-7B-8k

NaNK
2
42

AIM-3B

NaNK
2
3

OpenELM

0
1,442

starflow

0
275

CLaRa-7B-Instruct

NaNK
0
156

Sharp

0
137

AIM

0
88

coreml-stable-diffusion-xl-base-ios

0
34

CLaRa-7B-E2E

NaNK
0
13

CLaRa-7B-Base

NaNK
0
10

ml-reversal-blessing

0
5