timm

500 models • 10 total models in database
Sort by:

mobilenetv3_small_100.lamb_in1k

A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer based recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.5 - GMACs: 0.1 - Activations (M): 1.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

23,426,899
41

resnet50.a1_in1k

This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: ResNet Strikes Back `A1` recipe LAMB optimizer with BCE loss Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.6 - GMACs: 4.1 - Activations (M): 11.1 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |

3,552,276
39

resnet18.a1_in1k

--- tags: - image-classification - timm - transformers license: apache-2.0 library_name: timm ---

license:apache-2.0
2,892,108
12

convnextv2_nano.fcmae_ft_in22k_in1k

--- license: cc-by-nc-4.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-1k ---

license:cc-by-nc-4.0
2,701,808
3

efficientnet_b0.ra_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k ---

license:apache-2.0
1,060,685
4

ViT-B-16-SigLIP2-512

--- tags: - siglip - siglip2 - vision library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---

license:apache-2.0
803,163
1

vit_small_patch16_224.augreg_in21k_ft_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---

license:apache-2.0
783,000
4

resnet34.a1_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---

license:apache-2.0
780,024
1

vit_base_patch16_plus_clip_240.laion400m_e31

--- tags: - clip library_name: open_clip pipeline_tag: zero-shot-image-classification license: mit ---

license:mit
632,187
1

ViT-B-16-SigLIP-i18n-256

--- tags: - clip - siglip library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---

license:apache-2.0
628,283
4

vit_tiny_patch16_224.augreg_in21k_ft_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---

license:apache-2.0
504,587
3

resnet50.ram_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---

license:apache-2.0
497,414
0

vit_base_patch16_224.augreg2_in21k_ft_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---

license:apache-2.0
491,918
12

convnext_tiny.in12k_ft_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-12k ---

license:apache-2.0
374,075
3

convnext_femto.d1_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---

license:apache-2.0
363,398
0

vit_base_patch16_224.dino

--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---

license:apache-2.0
345,847
6

wide_resnet50_2.racm_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---

license:apache-2.0
336,431
2

convnext_large.fb_in22k_ft_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-22k ---

license:apache-2.0
324,046
2

vgg19.tv_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: bsd-3-clause datasets: - imagenet-1k ---

license:bsd-3-clause
321,948
6

mobilenetv3_large_100.ra_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k ---

license:apache-2.0
304,045
33

rexnet_150.nav_in1k

--- license: mit library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---

license:mit
298,743
0

vit_base_patch14_dinov2.lvd142m

--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---

license:apache-2.0
241,957
8

regnety_032.ra_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---

license:apache-2.0
228,519
0

samvit_base_patch16.sa1b

--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---

NaNK
license:apache-2.0
211,539
1

ViT-SO400M-16-SigLIP2-384

--- tags: - siglip - siglip2 - vision library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---

license:apache-2.0
204,598
5

resnet50.fb_swsl_ig1b_ft_in1k

--- license: cc-by-nc-4.0 library_name: timm tags: - image-classification - timm - transformers ---

NaNK
license:cc-by-nc-4.0
197,978
0

vit_small_patch14_reg4_dinov2.lvd142m

--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---

license:apache-2.0
191,884
6

deit_small_patch16_224.fb_in1k

--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---

license:apache-2.0
188,191
0

vit_base_patch16_224.augreg_in21k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-21k ---

license:apache-2.0
181,839
9

vit_base_patch32_384.augreg_in21k_ft_in1k

--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---

license:apache-2.0
176,437
0

edgenext_small.usi_in1k

An EdgeNeXt image classification model. Trained on ImageNet-1k by paper authors using distillation (`USI` as per `Solving ImageNet`). Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.6 - GMACs: 1.3 - Activations (M): 9.1 - Image size: train = 256 x 256, test = 320 x 320 - Papers: - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications: https://arxiv.org/abs/2206.10589 - Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results: https://arxiv.org/abs/2204.03475 - Dataset: ImageNet-1k - Original: https://github.com/mmaaz60/EdgeNeXt

license:mit
165,421
6

convnext_base.clip_laion2b_augreg_ft_in12k_in1k

NaNK
license:apache-2.0
157,099
0

vit_large_patch14_dinov2.lvd142m

A Vision Transformer (ViT) image feature model. Pretrained on LVD-142M with self-supervised DINOv2 method. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.4 - GMACs: 507.1 - Activations (M): 1058.8 - Image size: 518 x 518 - Papers: - DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Original: https://github.com/facebookresearch/dinov2 - Pretrain Dataset: LVD-142M Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
155,170
15

deit_tiny_patch16_224.fb_in1k

license:apache-2.0
146,682
0

ViT-B-16-SigLIP-256

license:apache-2.0
135,686
2

convnext_base.fb_in22k_ft_in1k

license:apache-2.0
132,882
3

vit_base_patch16_dinov3.lvd1689m

132,600
3

convnextv2_tiny.fcmae_ft_in1k

license:cc-by-nc-4.0
131,819
0

vit_base_patch16_clip_224.openai

Model Details The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. This instance of the CLIP model is intended for loading in `timm` (https://github.com/rwightman/pytorch-image-models) and `OpenCLIP` (https://github.com/mlfoundations/openclip) libraries. Please see https://huggingface.co/openai/clip-vit-base-patch16 for use in Hugging Face Transformers. Model Type The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. Intended Use The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Primary intended uses The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Out-of-Scope Use Cases Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Data The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Data Mission Statement Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. Limitations CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. Bias and Fairness We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.

license:apache-2.0
128,695
10

convnext_nano.in12k_ft_in1k

license:apache-2.0
128,528
0

mobilenetv3_small_075.lamb_in1k

A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.0 - GMACs: 0.0 - Activations (M): 1.3 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
124,949
1

tf_mobilenetv3_large_minimal_100.in1k

A MobileNet-v3 image classification model. Trained on ImageNet-1k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.9 - GMACs: 0.2 - Activations (M): 4.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
121,722
0

vit_large_patch14_reg4_dinov2.lvd142m

A Vision Transformer (ViT) image feature model with registers. Pretrained on LVD-142M with self-supervised DINOv2 method. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.4 - GMACs: 416.1 - Activations (M): 305.3 - Image size: 518 x 518 - Papers: - Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588 - DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Original: https://github.com/facebookresearch/dinov2 - Pretrain Dataset: LVD-142M Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
118,350
7

inception_v4.tf_in1k

license:apache-2.0
114,078
3

inception_v3.tv_in1k

A Inception-v3 image classification model. Trained on ImageNet-1k, torchvision weights. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 23.8 - GMACs: 5.7 - Activations (M): 9.0 - Image size: 299 x 299 - Papers: - Rethinking the Inception Architecture for Computer Vision: https://arxiv.org/abs/1512.00567 - Original: https://github.com/pytorch/vision - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
111,610
1

beitv2_base_patch16_224.in1k_ft_in22k

license:apache-2.0
110,992
1

tf_efficientnetv2_s.in21k_ft_in1k

license:apache-2.0
107,954
1

mobilenetv3_large_100.miil_in21k_ft_in1k

A MobileNet-v3 image classification model. Petrained on ImageNet-21k-P and fine-tuned on ImageNet-1k by Alibaba MIIL. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.5 - GMACs: 0.2 - Activations (M): 4.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k-P Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
102,835
1

tf_mobilenetv3_small_minimal_100.in1k

license:apache-2.0
99,726
0

vit_base_patch8_224.augreg2_in21k_ft_in1k

license:apache-2.0
98,060
3

vit_small_patch14_dinov2.lvd142m

license:apache-2.0
92,999
6

eva02_large_patch14_448.mim_m38m_ft_in22k_in1k

license:mit
92,753
21

swin_base_patch4_window7_224.ms_in22k_ft_in1k

license:mit
91,048
7

vit_small_patch16_dinov3.lvd1689m

87,550
1

swin_tiny_patch4_window7_224.ms_in1k

license:mit
85,855
0

vit_small_patch16_224.dino

license:apache-2.0
82,779
4

tf_efficientnetv2_m.in21k_ft_in1k

A EfficientNet-v2 image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 54.1 - GMACs: 15.9 - Activations (M): 57.5 - Image size: train = 384 x 384, test = 480 x 480 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
81,548
4

vit_so400m_patch16_siglip_256.v2_webli

license:apache-2.0
80,910
0

resnet101.tv_in1k

license:bsd-3-clause
80,694
0

resnet18.fb_swsl_ig1b_ft_in1k

NaNK
license:cc-by-nc-4.0
75,912
0

inception_v3.tf_adv_in1k

A Inception-v3 image classification model. Adversarially trained on ImageNet-1k by paper authors. Ported from Tensorflow by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 23.8 - GMACs: 5.7 - Activations (M): 9.0 - Image size: 299 x 299 - Papers: - Rethinking the Inception Architecture for Computer Vision: https://arxiv.org/abs/1512.00567 - Adversarial Attacks and Defences Competition: https://arxiv.org/abs/1804.00097 - Original: https://github.com/tensorflow/models - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
72,620
1

vit_large_patch14_clip_336.openai

Model Details The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. This instance of the CLIP model is intended for loading in `timm` (https://github.com/rwightman/pytorch-image-models) and `OpenCLIP` (https://github.com/mlfoundations/openclip) libraries. Please see https://huggingface.co/openai/clip-vit-large-patch14-336 for use in Hugging Face Transformers. Model Type The model uses a ViT-L/14 (336x336) Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. Intended Use The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Primary intended uses The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Out-of-Scope Use Cases Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Data The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Data Mission Statement Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. Limitations CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. Bias and Fairness We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.

license:apache-2.0
71,545
2

vit_base_patch16_clip_224.laion400m_e32

license:mit
70,978
0

mobilenetv3_small_050.lamb_in1k

license:apache-2.0
69,232
0

convnext_small.dinov3_lvd1689m

68,918
1

tf_efficientnetv2_s.in21k

A EfficientNet-v2 image classification model. Trained on ImageNet-21k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 48.2 - GMACs: 5.4 - Activations (M): 22.8 - Image size: train = 300 x 300, test = 384 x 384 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
68,879
5

ViT-B-16-SigLIP2-256

Model Details A SigLIP 2 Vision-Lanuage model trained on WebLI. This model has been converted for use in OpenCLIP from the original JAX checkpoints in Big Vision. Model Details - Model Type: Contrastive Image-Text, Zero-Shot Image Classification. - Original: https://github.com/google-research/bigvision - Dataset: WebLI - Papers: - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343

license:apache-2.0
64,776
6

vit_large_patch14_clip_224.metaclip_2pt5b

Model card for vitlargepatch14clip224.metaclip2pt5b Model Usage This model is a dual use `openclip` and `timm` model. Model name in OpenCLIP is `ViT-L-14-quickgelu`, and timm name is `vitlargepatch14clip224.metaclip2pt5b`.

NaNK
license:cc-by-nc-4.0
63,722
0

vit_large_patch16_dinov3.lvd1689m

63,260
0

vit_tiny_patch16_224.augreg_in21k

A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 9.7 - GMACs: 1.1 - Activations (M): 4.1 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
61,437
1

swin_base_patch4_window12_384.ms_in22k_ft_in1k

license:mit
60,948
0

ViT-SO400M-14-SigLIP

license:apache-2.0
60,637
18

mobilenetv2_100.ra_in1k

A MobileNet-v2 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: RandAugment `RA` recipe. Inspired by and evolved from EfficientNet RandAugment recipes. Published as `B` recipe in ResNet Strikes Back. RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.5 - GMACs: 0.3 - Activations (M): 6.7 - Image size: 224 x 224 - Papers: - MobileNetV2: Inverted Residuals and Linear Bottlenecks: https://arxiv.org/abs/1801.04381 - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
59,468
4

efficientnet_b3.ra2_in1k

A EfficientNet image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: RandAugment `RA2` recipe. Inspired by and evolved from EfficientNet RandAugment recipes. Published as `B` recipe in ResNet Strikes Back. RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 12.2 - GMACs: 1.6 - Activations (M): 21.5 - Image size: train = 288 x 288, test = 320 x 320 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
58,330
5

deit_base_distilled_patch16_384.fb_in1k

license:apache-2.0
57,787
1

twins_svt_large.in1k

A Twins-SVT image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 99.3 - GMACs: 15.1 - Activations (M): 35.1 - Image size: 224 x 224 - Papers: - Twins: Revisiting the Design of Spatial Attention in Vision Transformers: https://arxiv.org/abs/2104.13840 - Dataset: ImageNet-1k - Original: https://github.com/Meituan-AutoML/Twins Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
55,347
0

vit_base_patch32_clip_224.laion400m_e32

license:mit
54,458
0

densenet121.ra_in1k

license:apache-2.0
54,415
2

vit_small_plus_patch16_dinov3.lvd1689m

53,941
1

tf_efficientnet_b0.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.3 - GMACs: 0.4 - Activations (M): 6.7 - Image size: 224 x 224 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
53,708
3

vit_large_patch16_384.augreg_in21k_ft_in1k

Model card for vitlargepatch16384.augregin21kftin1k A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.7 - GMACs: 174.8 - Activations (M): 128.2 - Image size: 384 x 384 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
52,388
0

convnext_xxlarge.clip_laion2b_soup_ft_in1k

NaNK
license:apache-2.0
52,076
3

inception_resnet_v2.tf_in1k

license:apache-2.0
51,291
0

ViT-SO400M-14-SigLIP-384

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI. This model has been converted to PyTorch from the original JAX checkpoints in Big Vision. These weights are usable in both OpenCLIP (image + text) and timm (image only). Model Details - Model Type: Contrastive Image-Text, Zero-Shot Image Classification. - Original: https://github.com/google-research/bigvision - Dataset: WebLI - Papers: - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343

license:apache-2.0
49,584
81

resnet101.a1h_in1k

This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: Based on ResNet Strikes Back `A1` recipe LAMB optimizer Stronger dropout, stochastic depth, and RandAugment than paper `A1` recipe Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 44.5 - GMACs: 7.8 - Activations (M): 16.2 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |

license:apache-2.0
48,257
0

deit_base_distilled_patch16_224.fb_in1k

license:apache-2.0
48,093
0

vit_large_patch16_224.augreg_in21k_ft_in1k

license:apache-2.0
47,475
1

convnext_tiny.dinov3_lvd1689m

41,958
0

swin_base_patch4_window12_384.ms_in22k

license:mit
41,281
0

resnet50d.ra2_in1k

license:apache-2.0
40,556
0

efficientnet_b4.ra2_in1k

license:apache-2.0
40,481
0

convnext_tiny.in12k

A ConvNeXt image classification model. Trained in `timm` on ImageNet-12k (a 11821 class subset of full ImageNet-22k) by Ross Wightman. ImageNet-12k training done on TPUs thanks to support of the TRC program. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 36.9 - GMACs: 4.5 - Activations (M): 13.4 - Image size: 224 x 224 - Papers: - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - Original: https://github.com/huggingface/pytorch-image-models - Dataset: ImageNet-12k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP. | model |top1 |top5 |imgsize|paramcount|gmacs |macts |samplespersec|batchsize| |------------------------------------------------------------------------------------------------------------------------------|------|------|--------|-----------|------|------|---------------|----------| | convnextv2huge.fcmaeftin22kin1k512 |88.848|98.742|512 |660.29 |600.81|413.07|28.58 |48 | | convnextv2huge.fcmaeftin22kin1k384 |88.668|98.738|384 |660.29 |337.96|232.35|50.56 |64 | | convnextxxlarge.cliplaion2bsoupftin1k |88.612|98.704|256 |846.47 |198.09|124.45|122.45 |256 | | convnextlargemlp.cliplaion2bsoupftin12kin1k384 |88.312|98.578|384 |200.13 |101.11|126.74|196.84 |256 | | convnextv2large.fcmaeftin22kin1k384 |88.196|98.532|384 |197.96 |101.1 |126.74|128.94 |128 | | convnextlargemlp.cliplaion2bsoupftin12kin1k320 |87.968|98.47 |320 |200.13 |70.21 |88.02 |283.42 |256 | | convnextxlarge.fbin22kftin1k384 |87.75 |98.556|384 |350.2 |179.2 |168.99|124.85 |192 | | convnextv2base.fcmaeftin22kin1k384 |87.646|98.422|384 |88.72 |45.21 |84.49 |209.51 |256 | | convnextlarge.fbin22kftin1k384 |87.476|98.382|384 |197.77 |101.1 |126.74|194.66 |256 | | convnextlargemlp.cliplaion2baugregftin1k |87.344|98.218|256 |200.13 |44.94 |56.33 |438.08 |256 | | convnextv2large.fcmaeftin22kin1k |87.26 |98.248|224 |197.96 |34.4 |43.13 |376.84 |256 | | convnextbase.cliplaion2baugregftin12kin1k384 |87.138|98.212|384 |88.59 |45.21 |84.49 |365.47 |256 | | convnextxlarge.fbin22kftin1k |87.002|98.208|224 |350.2 |60.98 |57.5 |368.01 |256 | | convnextbase.fbin22kftin1k384 |86.796|98.264|384 |88.59 |45.21 |84.49 |366.54 |256 | | convnextv2base.fcmaeftin22kin1k |86.74 |98.022|224 |88.72 |15.38 |28.75 |624.23 |256 | | convnextlarge.fbin22kftin1k |86.636|98.028|224 |197.77 |34.4 |43.13 |581.43 |256 | | convnextbase.cliplaionaaugregftin1k384 |86.504|97.97 |384 |88.59 |45.21 |84.49 |368.14 |256 | | convnextbase.cliplaion2baugregftin12kin1k |86.344|97.97 |256 |88.59 |20.09 |37.55 |816.14 |256 | | convnextv2huge.fcmaeftin1k |86.256|97.75 |224 |660.29 |115.0 |79.07 |154.72 |256 | | convnextsmall.in12kftin1k384 |86.182|97.92 |384 |50.22 |25.58 |63.37 |516.19 |256 | | convnextbase.cliplaion2baugregftin1k |86.154|97.68 |256 |88.59 |20.09 |37.55 |819.86 |256 | | convnextbase.fbin22kftin1k |85.822|97.866|224 |88.59 |15.38 |28.75 |1037.66 |256 | | convnextsmall.fbin22kftin1k384 |85.778|97.886|384 |50.22 |25.58 |63.37 |518.95 |256 | | convnextv2large.fcmaeftin1k |85.742|97.584|224 |197.96 |34.4 |43.13 |375.23 |256 | | convnextsmall.in12kftin1k |85.174|97.506|224 |50.22 |8.71 |21.56 |1474.31 |256 | | convnexttiny.in12kftin1k384 |85.118|97.608|384 |28.59 |13.14 |39.48 |856.76 |256 | | convnextv2tiny.fcmaeftin22kin1k384 |85.112|97.63 |384 |28.64 |13.14 |39.48 |491.32 |256 | | convnextv2base.fcmaeftin1k |84.874|97.09 |224 |88.72 |15.38 |28.75 |625.33 |256 | | convnextsmall.fbin22kftin1k |84.562|97.394|224 |50.22 |8.71 |21.56 |1478.29 |256 | | convnextlarge.fbin1k |84.282|96.892|224 |197.77 |34.4 |43.13 |584.28 |256 | | convnexttiny.in12kftin1k |84.186|97.124|224 |28.59 |4.47 |13.44 |2433.7 |256 | | convnexttiny.fbin22kftin1k384 |84.084|97.14 |384 |28.59 |13.14 |39.48 |862.95 |256 | | convnextv2tiny.fcmaeftin22kin1k |83.894|96.964|224 |28.64 |4.47 |13.44 |1452.72 |256 | | convnextbase.fbin1k |83.82 |96.746|224 |88.59 |15.38 |28.75 |1054.0 |256 | | convnextv2nano.fcmaeftin22kin1k384 |83.37 |96.742|384 |15.62 |7.22 |24.61 |801.72 |256 | | convnextsmall.fbin1k |83.142|96.434|224 |50.22 |8.71 |21.56 |1464.0 |256 | | convnextv2tiny.fcmaeftin1k |82.92 |96.284|224 |28.64 |4.47 |13.44 |1425.62 |256 | | convnexttiny.fbin22kftin1k |82.898|96.616|224 |28.59 |4.47 |13.44 |2480.88 |256 | | convnextnano.in12kftin1k |82.282|96.344|224 |15.59 |2.46 |8.37 |3926.52 |256 | | convnexttinyhnf.a2hin1k |82.216|95.852|224 |28.59 |4.47 |13.44 |2529.75 |256 | | convnexttiny.fbin1k |82.066|95.854|224 |28.59 |4.47 |13.44 |2346.26 |256 | | convnextv2nano.fcmaeftin22kin1k |82.03 |96.166|224 |15.62 |2.46 |8.37 |2300.18 |256 | | convnextv2nano.fcmaeftin1k |81.83 |95.738|224 |15.62 |2.46 |8.37 |2321.48 |256 | | convnextnanools.d1hin1k |80.866|95.246|224 |15.65 |2.65 |9.38 |3523.85 |256 | | convnextnano.d1hin1k |80.768|95.334|224 |15.59 |2.46 |8.37 |3915.58 |256 | | convnextv2pico.fcmaeftin1k |80.304|95.072|224 |9.07 |1.37 |6.1 |3274.57 |256 | | convnextpico.d1in1k |79.526|94.558|224 |9.05 |1.37 |6.1 |5686.88 |256 | | convnextpicools.d1in1k |79.522|94.692|224 |9.06 |1.43 |6.5 |5422.46 |256 | | convnextv2femto.fcmaeftin1k |78.488|93.98 |224 |5.23 |0.79 |4.57 |4264.2 |256 | | convnextfemtools.d1in1k |77.86 |93.83 |224 |5.23 |0.82 |4.87 |6910.6 |256 | | convnextfemto.d1in1k |77.454|93.68 |224 |5.22 |0.79 |4.57 |7189.92 |256 | | convnextv2atto.fcmaeftin1k |76.664|93.044|224 |3.71 |0.55 |3.81 |4728.91 |256 | | convnextattools.a2in1k |75.88 |92.846|224 |3.7 |0.58 |4.11 |7963.16 |256 | | convnextatto.d2in1k |75.664|92.9 |224 |3.7 |0.55 |3.81 |8439.22 |256 |

license:apache-2.0
40,278
0

maxvit_nano_rw_256.sw_in1k

A timm specific MaxViT image classification model. Trained in `timm` on ImageNet-1k by Ross Wightman. ImageNet-1k training done on TPUs thanks to support of the TRC program. MaxxViT covers a number of related model architectures that share a common structure including: - CoAtNet - Combining MBConv (depthwise-separable) convolutional blocks in early stages with self-attention transformer blocks in later stages. - MaxViT - Uniform blocks across all stages, each containing a MBConv (depthwise-separable) convolution block followed by two self-attention blocks with different partitioning schemes (window followed by grid). - CoAtNeXt - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in CoAtNet. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in MaxViT. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT-V2 - A MaxxViT variation that removes the window block attention leaving only ConvNeXt blocks and grid attention w/ more width to compensate. Aside from the major variants listed above, there are more subtle changes from model to model. Any model name with the string `rw` are `timm` specific configs w/ modelling adjustments made to favour PyTorch eager use. These were created while training initial reproductions of the models so there are variations. All models with the string `tf` are models exactly matching Tensorflow based models by the original paper authors with weights ported to PyTorch. This covers a number of MaxViT models. The official CoAtNet models were never released. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 15.5 - GMACs: 4.5 - Activations (M): 30.3 - Image size: 256 x 256 - Papers: - MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/abs/2204.01697 - Dataset: ImageNet-1k Model Comparison By Top-1 |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| By Throughput (samples / sec) |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22|

license:apache-2.0
40,243
0

resnet50_clip.openai

license:mit
40,213
0

eva02_large_patch14_clip_224.merged2b_s4b_b131k

NaNK
license:mit
38,139
6

vit_base_patch32_224.augreg_in21k

A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 104.3 - GMACs: 4.4 - Activations (M): 4.2 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
37,964
1

vit_base_patch16_clip_224.openai_ft_in12k_in1k

license:apache-2.0
37,124
0

efficientnet_b5.sw_in12k_ft_in1k

A EfficientNet image classification model. Pretrained on ImageNet-12k and fine-tuned on ImageNet-1k by Ross Wightman in `timm` using recipe template described below. Recipe details: Based on Swin Transformer train / pretrain recipe with modifications (related to both DeiT and ConvNeXt recipes) AdamW optimizer, gradient clipping, EMA weight averaging Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 30.4 - GMACs: 9.6 - Activations (M): 93.6 - Image size: 448 x 448 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-12k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
36,654
2

maxvit_tiny_tf_512.in1k

An official MaxViT image classification model. Trained in tensorflow on ImageNet-1k by paper authors. Ported from official Tensorflow implementation (https://github.com/google-research/maxvit) to PyTorch by Ross Wightman. MaxxViT covers a number of related model architectures that share a common structure including: - CoAtNet - Combining MBConv (depthwise-separable) convolutional blocks in early stages with self-attention transformer blocks in later stages. - MaxViT - Uniform blocks across all stages, each containing a MBConv (depthwise-separable) convolution block followed by two self-attention blocks with different partitioning schemes (window followed by grid). - CoAtNeXt - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in CoAtNet. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in MaxViT. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT-V2 - A MaxxViT variation that removes the window block attention leaving only ConvNeXt blocks and grid attention w/ more width to compensate. Aside from the major variants listed above, there are more subtle changes from model to model. Any model name with the string `rw` are `timm` specific configs w/ modelling adjustments made to favour PyTorch eager use. These were created while training initial reproductions of the models so there are variations. All models with the string `tf` are models exactly matching Tensorflow based models by the original paper authors with weights ported to PyTorch. This covers a number of MaxViT models. The official CoAtNet models were never released. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 31.0 - GMACs: 33.5 - Activations (M): 257.6 - Image size: 512 x 512 - Papers: - MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/abs/2204.01697 - Dataset: ImageNet-1k Model Comparison By Top-1 |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| By Throughput (samples / sec) |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22|

license:apache-2.0
36,563
0

tf_efficientnetv2_xl.in21k

A EfficientNet-v2 image classification model. Trained on ImageNet-21k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 234.8 - GMACs: 52.8 - Activations (M): 139.2 - Image size: train = 384 x 384, test = 512 x 512 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
35,824
1

regnety_120.sw_in12k_ft_in1k

license:apache-2.0
34,602
0

mobilevit_s.cvnets_in1k

A MobileViT image classification model. Trained on ImageNet-1k by paper authors. See license details at https://github.com/apple/ml-cvnets/blob/main/LICENSE Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.6 - GMACs: 2.0 - Activations (M): 19.9 - Image size: 256 x 256 - Papers: - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer: https://arxiv.org/abs/2110.02178 - Original: https://github.com/apple/ml-cvnets - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

34,021
5

efficientnet_b2.ra_in1k

license:apache-2.0
33,055
0

convnextv2_tiny.fcmae_ft_in22k_in1k

license:cc-by-nc-4.0
31,851
2

ghostnet_100.in1k

license:apache-2.0
31,286
0

repvgg_a2.rvgg_in1k

license:mit
30,965
1

vit_base_r50_s16_384.orig_in21k_ft_in1k

A ResNet - Vision Transformer (ViT) hybrid image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 99.0 - GMACs: 61.3 - Activations (M): 81.8 - Image size: 384 x 384 - Papers: - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
29,400
4

pit_b_224.in1k

A PiT (Pooling based Vision Transformer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 73.8 - GMACs: 12.4 - Activations (M): 32.9 - Image size: 224 x 224 - Papers: - Rethinking Spatial Dimensions of Vision Transformers: https://arxiv.org/abs/2103.16302 - Dataset: ImageNet-1k - Original: https://github.com/naver-ai/pit Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
28,904
1

coat_small.in1k

A CoaT (Co-Scale Conv-Attentional Transformer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 21.7 - GMACs: 12.6 - Activations (M): 44.3 - Image size: 224 x 224 - Papers: - Co-Scale Conv-Attentional Image Transformers: https://arxiv.org/abs/2104.06399 - Dataset: ImageNet-1k - Original: https://github.com/mlpc-ucsd/CoaT Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
28,804
0

dm_nfnet_f0.dm_in1k

license:apache-2.0
28,767
1

test_resnet.r160_in1k

license:apache-2.0
28,459
0

vgg16.tv_in1k

license:bsd-3-clause
27,971
7

caformer_s36.sail_in1k

A CAFormer (a MetaFormer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 39.3 - GMACs: 8.0 - Activations (M): 37.5 - Image size: 224 x 224 - Papers: - Metaformer baselines for vision: https://arxiv.org/abs/2210.13452 - Original: https://github.com/sail-sg/metaformer - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
27,694
0

ViT-B-32-SigLIP2-256

license:apache-2.0
27,139
1

levit_256.fb_dist_in1k

license:apache-2.0
26,663
0

visformer_small.in1k

license:apache-2.0
26,292
1

efficientnet_b1.ra4_e3600_r240_in1k

license:apache-2.0
26,257
0

beit_base_patch16_224.in22k_ft_in22k_in1k

A BEiT image classification model. Trained on ImageNet-22k with self-supervised masked image modelling (MIM) using a DALL-E dVAE as visual tokenizer. Fine-tuned on ImageNet-22k and then ImageNet-1k. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 86.5 - GMACs: 17.6 - Activations (M): 23.9 - Image size: 224 x 224 - Papers: - BEiT: BERT Pre-Training of Image Transformers: https://arxiv.org/abs/2106.08254 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-22k - Original: https://github.com/microsoft/unilm/tree/master/beit Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
25,946
1

PE-Core-bigG-14-448

license:apache-2.0
25,394
5

nfnet_l0.ra2_in1k

license:apache-2.0
25,388
1

convnextv2_base.fcmae_ft_in22k_in1k

A ConvNeXt-V2 image classification model. Pretrained with a fully convolutional masked autoencoder framework (FCMAE) and fine-tuned on ImageNet-22k and then ImageNet-1k. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 88.7 - GMACs: 15.4 - Activations (M): 28.8 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders: https://arxiv.org/abs/2301.00808 - Original: https://github.com/facebookresearch/ConvNeXt-V2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP. | model |top1 |top5 |imgsize|paramcount|gmacs |macts |samplespersec|batchsize| |------------------------------------------------------------------------------------------------------------------------------|------|------|--------|-----------|------|------|---------------|----------| | convnextv2huge.fcmaeftin22kin1k512 |88.848|98.742|512 |660.29 |600.81|413.07|28.58 |48 | | convnextv2huge.fcmaeftin22kin1k384 |88.668|98.738|384 |660.29 |337.96|232.35|50.56 |64 | | convnextxxlarge.cliplaion2bsoupftin1k |88.612|98.704|256 |846.47 |198.09|124.45|122.45 |256 | | convnextlargemlp.cliplaion2bsoupftin12kin1k384 |88.312|98.578|384 |200.13 |101.11|126.74|196.84 |256 | | convnextv2large.fcmaeftin22kin1k384 |88.196|98.532|384 |197.96 |101.1 |126.74|128.94 |128 | | convnextlargemlp.cliplaion2bsoupftin12kin1k320 |87.968|98.47 |320 |200.13 |70.21 |88.02 |283.42 |256 | | convnextxlarge.fbin22kftin1k384 |87.75 |98.556|384 |350.2 |179.2 |168.99|124.85 |192 | | convnextv2base.fcmaeftin22kin1k384 |87.646|98.422|384 |88.72 |45.21 |84.49 |209.51 |256 | | convnextlarge.fbin22kftin1k384 |87.476|98.382|384 |197.77 |101.1 |126.74|194.66 |256 | | convnextlargemlp.cliplaion2baugregftin1k |87.344|98.218|256 |200.13 |44.94 |56.33 |438.08 |256 | | convnextv2large.fcmaeftin22kin1k |87.26 |98.248|224 |197.96 |34.4 |43.13 |376.84 |256 | | convnextbase.cliplaion2baugregftin12kin1k384 |87.138|98.212|384 |88.59 |45.21 |84.49 |365.47 |256 | | convnextxlarge.fbin22kftin1k |87.002|98.208|224 |350.2 |60.98 |57.5 |368.01 |256 | | convnextbase.fbin22kftin1k384 |86.796|98.264|384 |88.59 |45.21 |84.49 |366.54 |256 | | convnextv2base.fcmaeftin22kin1k |86.74 |98.022|224 |88.72 |15.38 |28.75 |624.23 |256 | | convnextlarge.fbin22kftin1k |86.636|98.028|224 |197.77 |34.4 |43.13 |581.43 |256 | | convnextbase.cliplaionaaugregftin1k384 |86.504|97.97 |384 |88.59 |45.21 |84.49 |368.14 |256 | | convnextbase.cliplaion2baugregftin12kin1k |86.344|97.97 |256 |88.59 |20.09 |37.55 |816.14 |256 | | convnextv2huge.fcmaeftin1k |86.256|97.75 |224 |660.29 |115.0 |79.07 |154.72 |256 | | convnextsmall.in12kftin1k384 |86.182|97.92 |384 |50.22 |25.58 |63.37 |516.19 |256 | | convnextbase.cliplaion2baugregftin1k |86.154|97.68 |256 |88.59 |20.09 |37.55 |819.86 |256 | | convnextbase.fbin22kftin1k |85.822|97.866|224 |88.59 |15.38 |28.75 |1037.66 |256 | | convnextsmall.fbin22kftin1k384 |85.778|97.886|384 |50.22 |25.58 |63.37 |518.95 |256 | | convnextv2large.fcmaeftin1k |85.742|97.584|224 |197.96 |34.4 |43.13 |375.23 |256 | | convnextsmall.in12kftin1k |85.174|97.506|224 |50.22 |8.71 |21.56 |1474.31 |256 | | convnexttiny.in12kftin1k384 |85.118|97.608|384 |28.59 |13.14 |39.48 |856.76 |256 | | convnextv2tiny.fcmaeftin22kin1k384 |85.112|97.63 |384 |28.64 |13.14 |39.48 |491.32 |256 | | convnextv2base.fcmaeftin1k |84.874|97.09 |224 |88.72 |15.38 |28.75 |625.33 |256 | | convnextsmall.fbin22kftin1k |84.562|97.394|224 |50.22 |8.71 |21.56 |1478.29 |256 | | convnextlarge.fbin1k |84.282|96.892|224 |197.77 |34.4 |43.13 |584.28 |256 | | convnexttiny.in12kftin1k |84.186|97.124|224 |28.59 |4.47 |13.44 |2433.7 |256 | | convnexttiny.fbin22kftin1k384 |84.084|97.14 |384 |28.59 |13.14 |39.48 |862.95 |256 | | convnextv2tiny.fcmaeftin22kin1k |83.894|96.964|224 |28.64 |4.47 |13.44 |1452.72 |256 | | convnextbase.fbin1k |83.82 |96.746|224 |88.59 |15.38 |28.75 |1054.0 |256 | | convnextv2nano.fcmaeftin22kin1k384 |83.37 |96.742|384 |15.62 |7.22 |24.61 |801.72 |256 | | convnextsmall.fbin1k |83.142|96.434|224 |50.22 |8.71 |21.56 |1464.0 |256 | | convnextv2tiny.fcmaeftin1k |82.92 |96.284|224 |28.64 |4.47 |13.44 |1425.62 |256 | | convnexttiny.fbin22kftin1k |82.898|96.616|224 |28.59 |4.47 |13.44 |2480.88 |256 | | convnextnano.in12kftin1k |82.282|96.344|224 |15.59 |2.46 |8.37 |3926.52 |256 | | convnexttinyhnf.a2hin1k |82.216|95.852|224 |28.59 |4.47 |13.44 |2529.75 |256 | | convnexttiny.fbin1k |82.066|95.854|224 |28.59 |4.47 |13.44 |2346.26 |256 | | convnextv2nano.fcmaeftin22kin1k |82.03 |96.166|224 |15.62 |2.46 |8.37 |2300.18 |256 | | convnextv2nano.fcmaeftin1k |81.83 |95.738|224 |15.62 |2.46 |8.37 |2321.48 |256 | | convnextnanools.d1hin1k |80.866|95.246|224 |15.65 |2.65 |9.38 |3523.85 |256 | | convnextnano.d1hin1k |80.768|95.334|224 |15.59 |2.46 |8.37 |3915.58 |256 | | convnextv2pico.fcmaeftin1k |80.304|95.072|224 |9.07 |1.37 |6.1 |3274.57 |256 | | convnextpico.d1in1k |79.526|94.558|224 |9.05 |1.37 |6.1 |5686.88 |256 | | convnextpicools.d1in1k |79.522|94.692|224 |9.06 |1.43 |6.5 |5422.46 |256 | | convnextv2femto.fcmaeftin1k |78.488|93.98 |224 |5.23 |0.79 |4.57 |4264.2 |256 | | convnextfemtools.d1in1k |77.86 |93.83 |224 |5.23 |0.82 |4.87 |6910.6 |256 | | convnextfemto.d1in1k |77.454|93.68 |224 |5.22 |0.79 |4.57 |7189.92 |256 | | convnextv2atto.fcmaeftin1k |76.664|93.044|224 |3.71 |0.55 |3.81 |4728.91 |256 | | convnextattools.a2in1k |75.88 |92.846|224 |3.7 |0.58 |4.11 |7963.16 |256 | | convnextatto.d2in1k |75.664|92.9 |224 |3.7 |0.55 |3.81 |8439.22 |256 |

license:cc-by-nc-4.0
24,892
2

tf_efficientnetv2_b0.in1k

license:apache-2.0
24,488
2

efficientnetv2_rw_m.agc_in1k

license:apache-2.0
24,196
0

maxvit_large_tf_224.in21k

license:apache-2.0
24,010
0

vit_base_patch16_384.augreg_in21k_ft_in1k

license:apache-2.0
23,824
0

deit_base_patch16_224.fb_in1k

A DeiT image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 86.6 - GMACs: 17.6 - Activations (M): 23.9 - Image size: 224 x 224 - Papers: - Training data-efficient image transformers & distillation through attention: https://arxiv.org/abs/2012.12877 - Original: https://github.com/facebookresearch/deit - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
23,600
0

vit_large_patch14_clip_224.laion400m_e32

license:mit
23,048
0

lcnet_050.ra2_in1k

license:apache-2.0
23,021
0

convnext_pico.d1_in1k

license:apache-2.0
22,580
0

vit_small_patch16_384.augreg_in21k_ft_in1k

Model card for vitsmallpatch16384.augregin21kftin1k A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 22.2 - GMACs: 12.4 - Activations (M): 24.2 - Image size: 384 x 384 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
22,420
2

ViT-B-16-SigLIP

license:apache-2.0
21,987
33

cait_m48_448.fb_dist_in1k

A CaiT (Class-Attention in Image Transformers) image classification model. Pretrained on ImageNet-1k with distillation by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 356.5 - GMACs: 329.4 - Activations (M): 1708.2 - Image size: 448 x 448 - Papers: - Going deeper with Image Transformers: https://arxiv.org/abs/2103.17239 - Dataset: ImageNet-1k - Original: https://github.com/facebookresearch/deit

license:apache-2.0
21,859
0

vit_small_patch16_224.augreg_in21k

license:apache-2.0
20,716
0

eva02_enormous_patch14_plus_clip_224.laion2b_s9b_b144k

NaNK
license:mit
20,615
8

wide_resnet101_2.tv_in1k

license:bsd-3-clause
20,265
0

tf_efficientnet_b3.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 12.2 - GMACs: 1.9 - Activations (M): 23.8 - Image size: 300 x 300 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
19,933
1

vit_base_patch16_siglip_224.v2_webli

license:apache-2.0
19,424
0

convnext_atto.d2_in1k

license:apache-2.0
19,367
0

resnetv2_50x1_bit.goog_in21k_ft_in1k

A ResNet-V2-BiT (Big Transfer w/ pre-activation ResNet) image classification model. Pretrained on ImageNet-21k and fine-tuned on ImageNet-1k by paper authors. This model uses: Group Normalization (GN) in combination with Weight Standardization (WS) instead of Batch Normalization (BN).. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.5 - GMACs: 16.6 - Activations (M): 44.5 - Image size: 448 x 448 - Papers: - Big Transfer (BiT): General Visual Representation Learning: https://arxiv.org/abs/1912.11370 - Identity Mappings in Deep Residual Networks: https://arxiv.org/abs/1603.05027 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/bigtransfer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
18,566
0

vit_base_patch16_siglip_256.webli

license:apache-2.0
18,343
1

vit_base_patch16_224.mae

license:cc-by-nc-4.0
18,205
5

vit_tiny_patch16_384.augreg_in21k_ft_in1k

license:apache-2.0
18,000
0

convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384

NaNK
license:apache-2.0
17,661
4

vit_large_patch16_siglip_256.v2_webli

license:apache-2.0
17,444
0

ViT-L-16-SigLIP2-512

license:apache-2.0
17,431
3

ViT-B-16-SigLIP2

license:apache-2.0
17,262
0

tf_efficientnet_b5.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 30.4 - GMACs: 10.5 - Activations (M): 98.9 - Image size: 456 x 456 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
17,159
1

fastvit_t8.apple_dist_in1k

A FastViT image classification model. Trained on ImageNet-1k with distillation by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 4.0 - GMACs: 0.7 - Activations (M): 8.6 - Image size: 256 x 256 - Papers: - FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization: https://arxiv.org/abs/2303.14189 - Original: https://github.com/apple/ml-fastvit - Dataset: ImageNet-1k

16,731
0

mobilenetv4_conv_small.e2400_r224_in1k

license:apache-2.0
16,657
10

ViT-SO400M-14-SigLIP2-378

license:apache-2.0
16,646
1

vit_base_patch16_siglip_512.v2_webli

A SigLIP 2 ViT (image encoder only) for `timm`. Equivalent to image tower from https://huggingface.co/timm/ViT-B-16-SigLIP2-512. Model Details - Dataset: webli - Papers: - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343

license:apache-2.0
16,546
2

convnext_small.in12k_ft_in1k

license:apache-2.0
16,325
0

convnextv2_tiny.fcmae_ft_in22k_in1k_384

license:cc-by-nc-4.0
16,249
2

resnext50_32x4d.a1h_in1k

license:apache-2.0
15,955
0

vit_base_patch32_224.augreg_in21k_ft_in1k

A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 88.2 - GMACs: 4.4 - Activations (M): 4.2 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
15,869
2

hrnet_w18.ms_aug_in1k

license:mit
15,745
3

resnet50.am_in1k

license:apache-2.0
15,422
0

vit_base_patch14_reg4_dinov2.lvd142m

license:apache-2.0
15,263
13

efficientformerv2_s0.snap_dist_in1k

license:apache-2.0
14,518
1

tf_efficientnet_b4.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 19.3 - GMACs: 4.5 - Activations (M): 49.5 - Image size: 380 x 380 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
14,481
0

resnet101_clip.yfcc15m

license:mit
14,131
0

swin_small_patch4_window7_224.ms_in22k_ft_in1k

license:mit
14,017
0

repvit_m1.dist_in1k

license:apache-2.0
13,882
1

resnet152.a1_in1k

license:apache-2.0
13,742
0

maxvit_tiny_rw_224.sw_in1k

license:apache-2.0
13,545
0

resnet50.tv_in1k

license:bsd-3-clause
13,404
1

resnet18.tv_in1k

license:bsd-3-clause
12,745
0

convnext_base.dinov3_lvd1689m

A DINOv3 ConvNeXt image feature model. Pretrained on LVD-1689M with self-supervised DINOv3 method, distilled from DINOv3 ViT-7B. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 87.6 - GMACs: 15.4 - Activations (M): 28.8 - Image size: 224 x 224 - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models - Original: https://github.com/facebookresearch/dinov3 - Pretrain Dataset: LVD-1689M - License: DINOv3

12,680
0

tiny_vit_21m_512.dist_in22k_ft_in1k

A TinyViT image classification model. Pretrained on ImageNet-22k with distillation and fine-tuned on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 21.3 - GMACs: 21.2 - Activations (M): 83.3 - Image size: 512 x 512 - Papers: - TinyViT: Fast Pretraining Distillation for Small Vision Transformers: https://arxiv.org/abs/2207.10666 - Original: https://github.com/microsoft/Cream/tree/main/TinyViT - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-22k

license:apache-2.0
12,590
2

tf_efficientnet_lite0.in1k

license:apache-2.0
12,294
0

beitv2_base_patch16_224.in1k_ft_in22k_in1k

license:apache-2.0
12,171
0

tf_efficientnet_b2.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 9.1 - GMACs: 1.0 - Activations (M): 13.8 - Image size: 260 x 260 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
12,042
0

densenet201.tv_in1k

license:apache-2.0
11,996
0

vit_so400m_patch14_siglip_224.v2_webli

license:apache-2.0
11,850
0

vit_large_patch14_clip_224.laion400m_e31

license:mit
11,602
0

vit_small_r26_s32_224.augreg_in21k

license:apache-2.0
11,537
0

resnext101_32x8d.fb_wsl_ig1b_ft_in1k

NaNK
license:cc-by-nc-4.0
11,461
0

vit_small_patch8_224.dino

license:apache-2.0
11,209
2

resnest14d.gluon_in1k

license:apache-2.0
11,132
0

vit_base_patch16_siglip_256.v2_webli

license:apache-2.0
10,974
3

ViT-L-16-SigLIP2-256

license:apache-2.0
10,962
0

convnext_tiny.fb_in22k_ft_in1k

license:apache-2.0
10,708
0

vit_large_patch16_siglip_384.v2_webli

license:apache-2.0
10,706
0

regnetx_002.pycls_in1k

license:mit
10,362
0

vit_7b_patch16_dinov3.lvd1689m

A DINOv3 ViT model image feature encoder. Pretrained on LVD-1689M with self-supervised DINOv3 method. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 6716.0 - GMACs: 1775.1 - Activations (M): 515.9 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |

NaNK
9,992
0

eva02_large_patch14_448.mim_m38m_ft_in22k

license:mit
9,784
0

eca_nfnet_l0.ra2_in1k

license:apache-2.0
9,711
0

vit_base_patch8_224.dino

license:apache-2.0
9,640
2

eva02_base_patch16_clip_224.merged2b_s8b_b131k

NaNK
license:mit
9,618
0

convnextv2_large.fcmae_ft_in22k_in1k

license:cc-by-nc-4.0
9,513
0

eva_large_patch14_196.in22k_ft_in22k_in1k

license:mit
9,467
3

fastvit_t8.apple_in1k

9,406
2

efficientnetv2_rw_s.ra2_in1k

license:apache-2.0
9,275
1

swinv2_tiny_window8_256.ms_in1k

license:mit
9,178
2

vit_large_patch14_clip_224.laion2b

NaNK
license:apache-2.0
9,098
0

convnext_tiny.fb_in22k

license:apache-2.0
8,772
1

vit_base_patch16_224_miil.in21k

license:apache-2.0
8,763
1

vit_small_patch16_dinov3_qkvb.lvd1689m

A DINOv3 ViT model image feature encoder. Distilled on LVD-1689M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 21.6 - GMACs: 6.3 - Activations (M): 17.0 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |

8,761
0

resnet50_clip.yfcc15m

license:mit
8,720
0

tf_efficientnetv2_l.in21k_ft_in1k

license:apache-2.0
8,606
2

fastvit_t12.apple_in1k

8,395
0

resnet152.a3_in1k

license:apache-2.0
8,269
0

resnet50_gn.a1h_in1k

license:apache-2.0
8,250
0

seresnet50.a1_in1k

license:apache-2.0
7,992
0

mnasnet_100.rmsp_in1k

license:apache-2.0
7,945
0

vit_base_patch32_clip_224.laion2b_e16

NaNK
license:mit
7,916
0

convnext_base.fb_in22k_ft_in1k_384

license:apache-2.0
7,891
0

convformer_s18.sail_in22k

license:apache-2.0
7,840
0

dm_nfnet_f1.dm_in1k

license:apache-2.0
7,791
0

convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320

NaNK
license:apache-2.0
7,645
4

tf_efficientnet_b1.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 7.8 - GMACs: 0.7 - Activations (M): 10.9 - Image size: 240 x 240 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
7,312
0

tiny_vit_21m_224.dist_in22k_ft_in1k

license:apache-2.0
7,188
0

mobilevitv2_050.cvnets_in1k

7,110
2

eca_nfnet_l2.ra3_in1k

license:apache-2.0
7,096
0

convnext_small.fb_in22k_ft_in1k

license:apache-2.0
6,975
1

mobilenetv4_conv_medium.e500_r256_in1k

license:apache-2.0
6,966
1

vit_huge_plus_patch16_dinov3.lvd1689m

A DINOv3 ViT model image feature encoder. Distilled on LVD-1689M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 840.5 - GMACs: 224.9 - Activations (M): 193.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |

6,887
3

maxvit_large_tf_512.in1k

license:apache-2.0
6,883
1

tf_efficientnet_b7.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 66.3 - GMACs: 38.3 - Activations (M): 289.9 - Image size: 600 x 600 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
6,837
0

tf_efficientnet_lite1.in1k

license:apache-2.0
6,699
0

resmlp_12_224.fb_in1k

license:apache-2.0
6,685
0

coatnet_1_rw_224.sw_in1k

license:apache-2.0
6,631
0

mobilenetv4_conv_small_050.e3000_r224_in1k

license:apache-2.0
6,609
5

resnetv2_50x1_bit.goog_in21k

license:apache-2.0
6,555
5

hrnet_w32.ms_in1k

license:mit
6,514
0

mobileone_s1.apple_in1k

6,442
0

vit_base_patch16_224.augreg_in21k_ft_in1k

license:apache-2.0
6,417
0

efficientvit_m5.r224_in1k

license:mit
6,362
0

vit_base_patch16_224.orig_in21k

license:apache-2.0
6,287
1

convnextv2_atto.fcmae_ft_in1k

license:cc-by-nc-4.0
6,272
0

resnet34.tv_in1k

license:bsd-3-clause
6,215
0

tinynet_a.in1k

license:apache-2.0
6,210
0

ViT-B-16-SigLIP-512

license:apache-2.0
6,185
11

vit_base_patch16_clip_224.metaclip_2pt5b

NaNK
license:cc-by-nc-4.0
6,097
1

wide_resnet101_2.tv2_in1k

license:bsd-3-clause
6,054
0

vit_base_patch16_clip_224.openai_ft_in1k

license:apache-2.0
6,026
1

pvt_v2_b2.in1k

license:apache-2.0
5,958
1

vit_small_patch16_224.augreg_in1k

license:apache-2.0
5,926
3

efficientnet_lite0.ra_in1k

license:apache-2.0
5,844
0

convnextv2_base.fcmae_ft_in22k_in1k_384

license:cc-by-nc-4.0
5,804
0

vit_small_r26_s32_224.augreg_in21k_ft_in1k

license:apache-2.0
5,789
0

deit_tiny_distilled_patch16_224.fb_in1k

license:apache-2.0
5,780
0

coatnet_0_rw_224.sw_in1k

license:apache-2.0
5,644
2

vgg11.tv_in1k

license:bsd-3-clause
5,632
0

ese_vovnet19b_dw.ra_in1k

NaNK
license:apache-2.0
5,560
0

dm_nfnet_f3.dm_in1k

license:apache-2.0
5,498
0

ViT-gopt-16-SigLIP2-384

license:apache-2.0
5,474
4

regnety_002.pycls_in1k

license:mit
5,460
1

ViT-B-16-SigLIP-384

license:apache-2.0
5,458
4

vit_so400m_patch16_siglip_512.v2_webli

license:apache-2.0
5,454
0

resnet152.a1h_in1k

license:apache-2.0
5,425
0

resnet10t.c3_in1k

license:apache-2.0
5,409
0

vit_huge_patch14_clip_224.metaclip_2pt5b

NaNK
license:cc-by-nc-4.0
5,391
0

wide_resnet50_2.tv2_in1k

license:bsd-3-clause
5,343
0

vit_large_patch14_clip_224.openai_ft_in1k

license:apache-2.0
5,315
1

efficientvit_b2.r224_in1k

license:apache-2.0
5,312
0

mixer_b16_224.goog_in21k_ft_in1k

license:apache-2.0
5,264
2

maxvit_base_tf_384.in1k

license:apache-2.0
5,204
1

convnext_base.fb_in22k

license:apache-2.0
5,105
0

eva02_base_patch16_clip_224.merged2b

NaNK
license:mit
5,057
0

swinv2_tiny_window16_256.ms_in1k

license:mit
5,031
0

vit_base_patch16_224.orig_in21k_ft_in1k

license:apache-2.0
5,030
3

densenet121.tv_in1k

license:apache-2.0
4,969
0

resnest50d.in1k

license:apache-2.0
4,935
0

PE-Core-B-16

license:apache-2.0
4,846
0

vit_small_patch32_224.augreg_in21k_ft_in1k

license:apache-2.0
4,841
2

swinv2_base_window12to16_192to256.ms_in22k_ft_in1k

license:mit
4,710
0

vit_large_patch14_clip_224.openai

license:apache-2.0
4,701
2

tf_efficientnetv2_b3.in21k_ft_in1k

license:apache-2.0
4,571
2

swinv2_base_window8_256.ms_in1k

license:mit
4,569
0

vit_base_patch32_clip_224.metaclip_400m

license:cc-by-nc-4.0
4,551
0

wide_resnet50_2.tv_in1k

license:bsd-3-clause
4,511
0

swin_large_patch4_window12_384.ms_in22k_ft_in1k

license:mit
4,497
0

efficientnetv2_rw_t.ra2_in1k

license:apache-2.0
4,412
0

deit_small_distilled_patch16_224.fb_in1k

license:apache-2.0
4,407
0

fbnetc_100.rmsp_in1k

license:apache-2.0
4,390
0

ViT-SO400M-16-SigLIP2-512

license:apache-2.0
4,322
5

resnet101.a1_in1k

license:apache-2.0
4,300
1

resnet50.tv2_in1k

license:bsd-3-clause
4,288
0

levit_128.fb_dist_in1k

license:apache-2.0
4,245
1

ViT-L-16-SigLIP-256

license:apache-2.0
4,243
1

cspdarknet53.ra_in1k

license:apache-2.0
4,210
0

efficientformerv2_l.snap_dist_in1k

license:apache-2.0
4,186
2

resnet50_clip.cc12m

license:mit
4,076
0

tinynet_e.in1k

license:apache-2.0
4,068
0

cait_s24_224.fb_dist_in1k

license:apache-2.0
4,005
1

seresnext50_32x4d.racm_in1k

license:apache-2.0
3,997
0

regnetx_006.pycls_in1k

license:mit
3,987
0

vit_gigantic_patch14_clip_224.metaclip_2pt5b

NaNK
license:cc-by-nc-4.0
3,986
0

regnety_160.deit_in1k

license:apache-2.0
3,963
0

swin_large_patch4_window7_224.ms_in22k

license:mit
3,913
1

regnety_008.pycls_in1k

license:mit
3,906
0

mobilenetv4_conv_small.e1200_r224_in1k

license:apache-2.0
3,867
3

mobilevitv2_200.cvnets_in1k

3,860
0

tresnet_m.miil_in21k

license:apache-2.0
3,833
1

rexnet_100.nav_in1k

license:mit
3,775
0

crossvit_9_240.in1k

license:apache-2.0
3,699
2

maxvit_base_tf_512.in21k_ft_in1k

license:apache-2.0
3,697
1

mixnet_l.ft_in1k

license:apache-2.0
3,686
0

swinv2_base_window12to24_192to384.ms_in22k_ft_in1k

license:mit
3,682
0

eva02_base_patch14_224.mim_in22k

license:mit
3,681
7

deit_base_patch16_384.fb_in1k

license:apache-2.0
3,678
0

swin_large_patch4_window7_224.ms_in22k_ft_in1k

license:mit
3,636
0

caformer_b36.sail_in22k_ft_in1k

license:apache-2.0
3,631
1

convnext_small.fb_in22k_ft_in1k_384

license:apache-2.0
3,620
1

maxvit_tiny_tf_224.in1k

license:apache-2.0
3,584
0

ViT-L-16-SigLIP-384

license:apache-2.0
3,580
28

swinv2_cr_tiny_ns_224.sw_in1k

license:apache-2.0
3,580
0

vit_large_patch16_dinov3.sat493m

A DINOv3 ViT model image feature encoder. Distilled on SAT-493M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 303.1 - GMACs: 82.4 - Activations (M): 90.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: SAT-493M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |

3,557
0

tiny_vit_21m_384.dist_in22k_ft_in1k

license:apache-2.0
3,555
3

davit_tiny.msft_in1k

license:apache-2.0
3,555
1

vit_large_patch32_384.orig_in21k_ft_in1k

license:apache-2.0
3,552
0

resnet18d.ra2_in1k

license:apache-2.0
3,551
0

resnetv2_50.a1h_in1k

license:apache-2.0
3,536
0

swin_base_patch4_window7_224.ms_in22k

license:mit
3,492
0

vit_base_patch32_clip_224.metaclip_2pt5b

NaNK
license:cc-by-nc-4.0
3,468
0

PE-Core-L-14-336

license:apache-2.0
3,436
3

dla34.in1k

license:bsd-3-clause
3,433
0

eva02_base_patch14_448.mim_in22k_ft_in22k_in1k

Model card for eva02basepatch14448.mimin22kftin22kin1k An EVA02 image classification model. Pretrained on ImageNet-22k with masked image modeling (using EVA-CLIP as a MIM teacher) and fine-tuned on ImageNet-22k then on ImageNet-1k by paper authors. EVA-02 models are vision transformers with mean pooling, SwiGLU, Rotary Position Embeddings (ROPE), and extra LN in MLP (for Base & Large). NOTE: `timm` checkpoints are float32 for consistency with other models. Original checkpoints are float16 or bfloat16 in some cases, see originals if that's preferred. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 87.1 - GMACs: 107.1 - Activations (M): 259.1 - Image size: 448 x 448 - Papers: - EVA-02: A Visual Representation for Neon Genesis: https://arxiv.org/abs/2303.11331 - EVA-CLIP: Improved Training Techniques for CLIP at Scale: https://arxiv.org/abs/2303.15389 - Original: - https://github.com/baaivision/EVA - https://huggingface.co/Yuxin-CV/EVA-02 - Pretrain Dataset: ImageNet-22k - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |top1 |top5 |paramcount|imgsize| |-----------------------------------------------|------|------|-----------|--------| |eva02largepatch14448.mimm38mftin22kin1k |90.054|99.042|305.08 |448 | |eva02largepatch14448.mimin22kftin22kin1k|89.946|99.01 |305.08 |448 | |evagiantpatch14560.m30mftin22kin1k |89.792|98.992|1014.45 |560 | |eva02largepatch14448.mimin22kftin1k |89.626|98.954|305.08 |448 | |eva02largepatch14448.mimm38mftin1k |89.57 |98.918|305.08 |448 | |evagiantpatch14336.m30mftin22kin1k |89.56 |98.956|1013.01 |336 | |evagiantpatch14336.clipftin1k |89.466|98.82 |1013.01 |336 | |evalargepatch14336.in22kftin22kin1k |89.214|98.854|304.53 |336 | |evagiantpatch14224.clipftin1k |88.882|98.678|1012.56 |224 | |eva02basepatch14448.mimin22kftin22kin1k |88.692|98.722|87.12 |448 | |evalargepatch14336.in22kftin1k |88.652|98.722|304.53 |336 | |evalargepatch14196.in22kftin22kin1k |88.592|98.656|304.14 |196 | |eva02basepatch14448.mimin22kftin1k |88.23 |98.564|87.12 |448 | |evalargepatch14196.in22kftin1k |87.934|98.504|304.14 |196 | |eva02smallpatch14336.mimin22kftin1k |85.74 |97.614|22.13 |336 | |eva02tinypatch14336.mimin22kftin1k |80.658|95.524|5.76 |336 |

license:mit
3,420
6

vit_relpos_medium_patch16_224.sw_in1k

license:apache-2.0
3,272
0

vit_base_patch32_clip_224.laion400m_e31

license:mit
3,266
0

vgg16_bn.tv_in1k

license:bsd-3-clause
3,229
1

regnetz_040_h.ra3_in1k

license:apache-2.0
3,115
1

convnext_large.dinov3_lvd1689m

A DINOv3 ConvNeXt image feature model. Pretrained on LVD-1689M with self-supervised DINOv3 method, distilled from DINOv3 ViT-7B. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 196.2 - GMACs: 34.4 - Activations (M): 43.1 - Image size: 224 x 224 - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models - Original: https://github.com/facebookresearch/dinov3 - Pretrain Dataset: LVD-1689M - License: DINOv3

3,113
0

res2next50.in1k

3,098
0

maxvit_small_tf_224.in1k

license:apache-2.0
3,090
0

vit_base_patch16_siglip_gap_224.v2_webli

license:apache-2.0
3,082
0

mobilevit_xxs.cvnets_in1k

3,070
1

tf_efficientnetv2_s.in1k

license:apache-2.0
3,063
0

ghostnetv2_100.in1k

license:apache-2.0
3,049
1

convit_base.fb_in1k

license:apache-2.0
3,016
0

mobilenetv2_050.lamb_in1k

license:apache-2.0
2,973
2

maxvit_base_tf_224.in1k

license:apache-2.0
2,956
1

convnextv2_large.fcmae_ft_in22k_in1k_384

license:cc-by-nc-4.0
2,940
0

mobilevitv2_075.cvnets_in1k

2,932
0

gmlp_s16_224.ra3_in1k

license:apache-2.0
2,928
1

resnet50d.a1_in1k

license:apache-2.0
2,907
0

eva02_large_patch14_clip_336.merged2b_s6b_b61k

NaNK
license:mit
2,906
0

hrnet_w48.ms_in1k

license:mit
2,882
1

eva02_enormous_patch14_clip_224.laion2b_s4b_b115k

NaNK
license:mit
2,878
1

swinv2_small_window8_256.ms_in1k

license:mit
2,806
0

vit_giant_patch14_dinov2.lvd142m

license:apache-2.0
2,806
0

tf_efficientnetv2_m.in21k

license:apache-2.0
2,795
2

convnext_tiny.fb_in1k

license:apache-2.0
2,778
1

dm_nfnet_f2.dm_in1k

license:apache-2.0
2,775
0

swinv2_small_window16_256.ms_in1k

license:mit
2,769
0

resnet18.a3_in1k

license:apache-2.0
2,750
0

mobilevitv2_100.cvnets_in1k

2,715
1

efficientnet_b1.ft_in1k

license:apache-2.0
2,690
0

vit_base_patch16_clip_224.laion2b_ft_in12k_in1k

NaNK
license:apache-2.0
2,687
2

mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k

license:apache-2.0
2,686
2

resnetv2_101x1_bit.goog_in21k_ft_in1k

license:apache-2.0
2,679
0

darknet53.c2ns_in1k

license:apache-2.0
2,678
0

ViT-L-16-SigLIP2-384

license:apache-2.0
2,658
0

repvit_m1_5.dist_450e_in1k

license:apache-2.0
2,642
0

deit3_small_patch16_224.fb_in22k_ft_in1k

license:apache-2.0
2,635
0

convnext_small.fb_in22k

license:apache-2.0
2,634
1

resnest101e.in1k

license:apache-2.0
2,595
1

mobilenetv4_conv_large.e600_r384_in1k

license:apache-2.0
2,575
3

regnety_040.ra3_in1k

license:apache-2.0
2,572
0

xception41.tf_in1k

license:apache-2.0
2,561
1

resnext101_32x8d.tv_in1k

license:bsd-3-clause
2,553
1

eva_giant_patch14_plus_clip_224.merged2b_s11b_b114k

NaNK
license:mit
2,518
1

vit_large_patch32_224.orig_in21k

license:apache-2.0
2,502
0

mobilenetv3_large_100.miil_in21k

license:apache-2.0
2,413
0

tresnet_l.miil_in1k

license:apache-2.0
2,353
0

hrnet_w18_small_v2.gluon_in1k

license:apache-2.0
2,310
0

eva_large_patch14_336.in22k_ft_in22k_in1k

license:mit
2,298
1

vit_huge_patch14_224.orig_in21k

license:apache-2.0
2,293
2

convnextv2_pico.fcmae_ft_in1k

license:cc-by-nc-4.0
2,268
1

resnet101_clip.openai

license:mit
2,268
0

vit_base_patch32_clip_224.openai

license:apache-2.0
2,225
0

resnet34d.ra2_in1k

license:apache-2.0
2,176
0

davit_small.msft_in1k

license:apache-2.0
2,175
1

deit3_small_patch16_224.fb_in1k

license:apache-2.0
2,171
0

vit_huge_patch14_clip_224.laion2b

NaNK
license:apache-2.0
2,165
0

beit_base_patch16_224.in22k_ft_in22k

license:apache-2.0
2,126
0

efficientvit_l2.r224_in1k

license:apache-2.0
2,109
1

regnety_016.tv2_in1k

license:bsd-3-clause
2,084
0

vit_base_patch16_clip_224.metaclip_400m

license:cc-by-nc-4.0
2,079
1

fastvit_s12.apple_in1k

2,075
0

convnextv2_huge.fcmae_ft_in22k_in1k_384

license:cc-by-nc-4.0
2,064
2

tiny_vit_5m_224.dist_in22k_ft_in1k

license:apache-2.0
2,061
1

gernet_l.idstcv_in1k

license:apache-2.0
2,040
0

pit_s_distilled_224.in1k

license:apache-2.0
2,031
0

mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k

license:apache-2.0
2,014
1

fastvit_ma36.apple_in1k

2,008
1

deit3_base_patch16_224.fb_in1k

license:apache-2.0
1,991
0

volo_d1_224.sail_in1k

license:apache-2.0
1,988
2

vit_base_patch32_clip_224.laion2b_ft_in12k_in1k

NaNK
license:apache-2.0
1,981
2

botnet26t_256.c1_in1k

license:apache-2.0
1,980
0

convmixer_768_32.in1k

license:mit
1,968
2

convnext_small.in12k_ft_in1k_384

license:apache-2.0
1,950
0

resnet50.a1h_in1k

license:apache-2.0
1,945
0

vit_so400m_patch14_siglip_384.webli

license:apache-2.0
1,930
0

pvt_v2_b0.in1k

license:apache-2.0
1,918
1

tf_efficientnetv2_xl.in21k_ft_in1k

license:apache-2.0
1,916
4

fastvit_sa12.apple_dist_in1k

1,902
1

mobileone_s0.apple_in1k

1,898
1

fbnetv3_b.ra2_in1k

license:apache-2.0
1,869
0

spnasnet_100.rmsp_in1k

license:apache-2.0
1,862
0

twins_pcpvt_base.in1k

license:apache-2.0
1,859
1

caformer_s36.sail_in22k_ft_in1k_384

license:apache-2.0
1,844
2

mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k

license:apache-2.0
1,835
1

res2net50_26w_6s.in1k

1,830
0

res2net50_14w_8s.in1k

1,824
0

pnasnet5large.tf_in1k

license:apache-2.0
1,804
0

tf_efficientnetv2_b3.in21k

license:apache-2.0
1,802
2

swinv2_base_window12_192.ms_in22k

license:mit
1,795
0

regnetx_032.tv2_in1k

license:bsd-3-clause
1,795
0

tiny_vit_21m_224.dist_in22k

license:apache-2.0
1,781
0

res2net101_26w_4s.in1k

1,773
0

mobilevit_xs.cvnets_in1k

1,769
0

swin_s3_tiny_224.ms_in1k

license:mit
1,763
0

dpn107.mx_in1k

license:apache-2.0
1,763
0

resnext101_32x16d.fb_swsl_ig1b_ft_in1k

NaNK
license:cc-by-nc-4.0
1,762
2

selecsls42b.in1k

NaNK
license:cc-by-4.0
1,749
0

vit_base_patch16_224.augreg_in1k

license:apache-2.0
1,740
2

inception_v3.gluon_in1k

license:apache-2.0
1,738
1

tf_mixnet_l.in1k

license:apache-2.0
1,738
0

eca_botnext26ts_256.c1_in1k

license:apache-2.0
1,730
0

cait_m36_384.fb_dist_in1k

license:apache-2.0
1,728
1

coat_lite_mini.in1k

license:apache-2.0
1,728
0

nasnetalarge.tf_in1k

license:apache-2.0
1,727
0

beit_large_patch16_512.in22k_ft_in22k_in1k

license:apache-2.0
1,725
0

swin_s3_base_224.ms_in1k

license:mit
1,721
3

dla102.in1k

license:bsd-3-clause
1,707
0

eca_halonext26ts.c1_in1k

license:apache-2.0
1,699
0

sebotnet33ts_256.a1h_in1k

license:apache-2.0
1,699
0

convnext_large.fb_in22k

license:apache-2.0
1,696
0

resnetv2_101.a1h_in1k

license:apache-2.0
1,682
0

gmixer_24_224.ra3_in1k

license:apache-2.0
1,674
1

tiny_vit_5m_224.dist_in22k

license:apache-2.0
1,661
0

vit_base_r50_s16_224.orig_in21k

license:apache-2.0
1,655
0

maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k

license:apache-2.0
1,654
1

efficientvit_b0.r224_in1k

license:apache-2.0
1,652
4

convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384

NaNK
license:apache-2.0
1,651
3

deit3_small_patch16_384.fb_in1k

license:apache-2.0
1,648
0

poolformer_m36.sail_in1k

license:apache-2.0
1,628
0

mobilenetv4_conv_medium.e500_r224_in1k

license:apache-2.0
1,623
1

vit_large_patch16_224.augreg_in21k

license:apache-2.0
1,615
0

efficientnet_el.ra_in1k

license:apache-2.0
1,597
0

rexnet_300.nav_in1k

license:mit
1,595
0

xcit_large_24_p8_224.fb_in1k

license:apache-2.0
1,591
1

vit_gigantic_patch14_clip_quickgelu_224.metaclip_2pt5b

NaNK
license:apache-2.0
1,580
0

convnextv2_nano.fcmae_ft_in22k_in1k_384

license:cc-by-nc-4.0
1,577
0

nest_base_jx.goog_in1k

A NesT image classification model. Trained on ImageNet-1k by paper authors in JAX. Ported to PyTorch by Alexander Soare. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 67.7 - GMACs: 18.0 - Activations (M): 53.4 - Image size: 224 x 224 - Papers: - Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding: https://arxiv.org/abs/2105.12723 - Dataset: ImageNet-1k - Original: https://github.com/google-research/nested-transformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
1,572
0

edgenext_xx_small.in1k

license:mit
1,570
2

ViT-SO400M-14-SigLIP2

license:apache-2.0
1,565
0

vit_large_patch16_dinov3_qkvb.sat493m

A DINOv3 ViT model image feature encoder. Distilled on SAT-493M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 303.1 - GMACs: 82.4 - Activations (M): 90.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: SAT-493M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |

1,560
0

eva02_large_patch14_224.mim_in22k

license:mit
1,553
2

vgg11_bn.tv_in1k

license:bsd-3-clause
1,537
0

resnetrs50.tf_in1k

license:apache-2.0
1,526
0

caformer_s18.sail_in22k_ft_in1k_384

license:apache-2.0
1,502
0

tf_efficientnetv2_b2.in1k

license:apache-2.0
1,489
0

vit_base_patch16_clip_224.laion400m_e31

license:mit
1,470
0

densenet169.tv_in1k

license:apache-2.0
1,469
0

efficientformer_l3.snap_dist_in1k

license:apache-2.0
1,466
1

tf_mobilenetv3_large_100.in1k

license:apache-2.0
1,456
0

mobilenetv4_hybrid_medium.e500_r224_in1k

license:apache-2.0
1,449
1

seresnext26d_32x4d.bt_in1k

license:apache-2.0
1,449
0

tf_efficientnet_b0.in1k

license:apache-2.0
1,449
0

edgenext_base.usi_in1k

license:mit
1,428
0

eva02_large_patch14_clip_224.merged2b

NaNK
license:mit
1,424
0

MobileCLIP2-S0-OpenCLIP

These weights and model card are adapted from the original Apple model at https://huggingface.co/apple/MobileCLIP2-S0. This version uses canonical OpenCLIP configs and weight naming. MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S0 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 |

1,400
0

levit_384.fb_dist_in1k

license:apache-2.0
1,394
0

eva02_tiny_patch14_224.mim_in22k

license:mit
1,381
1

ViT-B-16-SigLIP2-384

license:apache-2.0
1,366
0

convnextv2_base.fcmae_ft_in1k

license:cc-by-nc-4.0
1,361
0

resnet26d.bt_in1k

license:apache-2.0
1,355
0

xception71.tf_in1k

license:apache-2.0
1,341
0

vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k

license:apache-2.0
1,337
4

resnet50x4_clip.openai

license:mit
1,335
0

pvt_v2_b1.in1k

license:apache-2.0
1,325
0

tf_efficientnet_b6.ns_jft_in1k

A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 43.0 - GMACs: 19.4 - Activations (M): 167.4 - Image size: 528 x 528 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
1,302
1

swinv2_base_window16_256.ms_in1k

license:mit
1,299
1

mobilenetv5_300m.gemma3n

1,277
2

efficientnet_em.ra2_in1k

license:apache-2.0
1,277
0

efficientformer_l1.snap_dist_in1k

license:apache-2.0
1,274
1

fastvit_sa12.apple_in1k

1,254
1

vit_large_patch14_clip_224.laion2b_ft_in12k_in1k

NaNK
license:apache-2.0
1,253
0

nextvit_large.bd_ssld_6m_in1k

license:apache-2.0
1,240
0

tf_efficientnetv2_b1.in1k

license:apache-2.0
1,227
0

convit_tiny.fb_in1k

license:apache-2.0
1,225
0

mobilenetv4_hybrid_large.ix_e600_r384_in1k

license:apache-2.0
1,220
5

swinv2_large_window12to16_192to256.ms_in22k_ft_in1k

license:mit
1,212
0

convnext_xlarge.fb_in22k_ft_in1k

license:apache-2.0
1,205
0

convnext_nano.in12k

license:apache-2.0
1,191
1

swiftformer_xs.dist_in1k

A SwiftFormer image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.5 - GMACs: 0.6 - Activations (M): 6.4 - Image size: 224 x 224 - Dataset: ImageNet-1k - Papers: - SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications: https://arxiv.org/abs/2303.15446 - Original: https://github.com/Amshaker/SwiftFormer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

license:apache-2.0
1,188
0

convnext_tiny.in12k_ft_in1k_384

license:apache-2.0
1,180
0

maxvit_base_tf_512.in1k

license:apache-2.0
1,178
0

convnextv2_huge.fcmae_ft_in22k_in1k_512

license:cc-by-nc-4.0
1,176
3

eva02_large_patch14_448.mim_in22k_ft_in22k_in1k

license:mit
1,171
1

vit_so150m2_patch16_reg1_gap_448.sbb_e200_in12k_ft_in1k

license:apache-2.0
1,154
1

convformer_s18.sail_in1k

license:apache-2.0
1,149
1

convnext_base.fb_in1k

license:apache-2.0
1,149
0

convnext_large.fb_in22k_ft_in1k_384

license:apache-2.0
1,146
0

pvt_v2_b5.in1k

license:apache-2.0
1,144
1

inception_next_tiny.sail_in1k

license:apache-2.0
1,141
0

beitv2_large_patch16_224.in1k_ft_in22k_in1k

license:apache-2.0
1,139
2