timm
mobilenetv3_small_100.lamb_in1k
A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer based recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.5 - GMACs: 0.1 - Activations (M): 1.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
resnet50.a1_in1k
This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: ResNet Strikes Back `A1` recipe LAMB optimizer with BCE loss Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.6 - GMACs: 4.1 - Activations (M): 11.1 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |
resnet18.a1_in1k
--- tags: - image-classification - timm - transformers license: apache-2.0 library_name: timm ---
convnextv2_nano.fcmae_ft_in22k_in1k
--- license: cc-by-nc-4.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-1k ---
efficientnet_b0.ra_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k ---
ViT-B-16-SigLIP2-512
--- tags: - siglip - siglip2 - vision library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---
vit_small_patch16_224.augreg_in21k_ft_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---
resnet34.a1_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---
vit_base_patch16_plus_clip_240.laion400m_e31
--- tags: - clip library_name: open_clip pipeline_tag: zero-shot-image-classification license: mit ---
ViT-B-16-SigLIP-i18n-256
--- tags: - clip - siglip library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---
vit_tiny_patch16_224.augreg_in21k_ft_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---
resnet50.ram_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---
vit_base_patch16_224.augreg2_in21k_ft_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---
convnext_tiny.in12k_ft_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-12k ---
convnext_femto.d1_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---
vit_base_patch16_224.dino
--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---
wide_resnet50_2.racm_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers ---
convnext_large.fb_in22k_ft_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k - imagenet-22k ---
vgg19.tv_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: bsd-3-clause datasets: - imagenet-1k ---
mobilenetv3_large_100.ra_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k ---
rexnet_150.nav_in1k
--- license: mit library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---
vit_base_patch14_dinov2.lvd142m
--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---
regnety_032.ra_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---
samvit_base_patch16.sa1b
--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---
ViT-SO400M-16-SigLIP2-384
--- tags: - siglip - siglip2 - vision library_name: open_clip pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli ---
resnet50.fb_swsl_ig1b_ft_in1k
--- license: cc-by-nc-4.0 library_name: timm tags: - image-classification - timm - transformers ---
vit_small_patch14_reg4_dinov2.lvd142m
--- license: apache-2.0 library_name: timm tags: - image-feature-extraction - timm - transformers ---
deit_small_patch16_224.fb_in1k
--- license: apache-2.0 library_name: timm tags: - image-classification - timm - transformers datasets: - imagenet-1k ---
vit_base_patch16_224.augreg_in21k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-21k ---
vit_base_patch32_384.augreg_in21k_ft_in1k
--- tags: - image-classification - timm - transformers library_name: timm license: apache-2.0 datasets: - imagenet-1k - imagenet-21k ---
edgenext_small.usi_in1k
An EdgeNeXt image classification model. Trained on ImageNet-1k by paper authors using distillation (`USI` as per `Solving ImageNet`). Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.6 - GMACs: 1.3 - Activations (M): 9.1 - Image size: train = 256 x 256, test = 320 x 320 - Papers: - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications: https://arxiv.org/abs/2206.10589 - Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results: https://arxiv.org/abs/2204.03475 - Dataset: ImageNet-1k - Original: https://github.com/mmaaz60/EdgeNeXt
convnext_base.clip_laion2b_augreg_ft_in12k_in1k
vit_large_patch14_dinov2.lvd142m
A Vision Transformer (ViT) image feature model. Pretrained on LVD-142M with self-supervised DINOv2 method. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.4 - GMACs: 507.1 - Activations (M): 1058.8 - Image size: 518 x 518 - Papers: - DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Original: https://github.com/facebookresearch/dinov2 - Pretrain Dataset: LVD-142M Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
deit_tiny_patch16_224.fb_in1k
ViT-B-16-SigLIP-256
convnext_base.fb_in22k_ft_in1k
vit_base_patch16_dinov3.lvd1689m
convnextv2_tiny.fcmae_ft_in1k
vit_base_patch16_clip_224.openai
Model Details The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. This instance of the CLIP model is intended for loading in `timm` (https://github.com/rwightman/pytorch-image-models) and `OpenCLIP` (https://github.com/mlfoundations/openclip) libraries. Please see https://huggingface.co/openai/clip-vit-base-patch16 for use in Hugging Face Transformers. Model Type The model uses a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. Intended Use The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Primary intended uses The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Out-of-Scope Use Cases Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Data The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Data Mission Statement Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. Limitations CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. Bias and Fairness We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
convnext_nano.in12k_ft_in1k
mobilenetv3_small_075.lamb_in1k
A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.0 - GMACs: 0.0 - Activations (M): 1.3 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
tf_mobilenetv3_large_minimal_100.in1k
A MobileNet-v3 image classification model. Trained on ImageNet-1k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.9 - GMACs: 0.2 - Activations (M): 4.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_large_patch14_reg4_dinov2.lvd142m
A Vision Transformer (ViT) image feature model with registers. Pretrained on LVD-142M with self-supervised DINOv2 method. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.4 - GMACs: 416.1 - Activations (M): 305.3 - Image size: 518 x 518 - Papers: - Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588 - DINOv2: Learning Robust Visual Features without Supervision: https://arxiv.org/abs/2304.07193 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Original: https://github.com/facebookresearch/dinov2 - Pretrain Dataset: LVD-142M Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
inception_v4.tf_in1k
inception_v3.tv_in1k
A Inception-v3 image classification model. Trained on ImageNet-1k, torchvision weights. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 23.8 - GMACs: 5.7 - Activations (M): 9.0 - Image size: 299 x 299 - Papers: - Rethinking the Inception Architecture for Computer Vision: https://arxiv.org/abs/1512.00567 - Original: https://github.com/pytorch/vision - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
beitv2_base_patch16_224.in1k_ft_in22k
tf_efficientnetv2_s.in21k_ft_in1k
mobilenetv3_large_100.miil_in21k_ft_in1k
A MobileNet-v3 image classification model. Petrained on ImageNet-21k-P and fine-tuned on ImageNet-1k by Alibaba MIIL. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.5 - GMACs: 0.2 - Activations (M): 4.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k-P Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
tf_mobilenetv3_small_minimal_100.in1k
vit_base_patch8_224.augreg2_in21k_ft_in1k
vit_small_patch14_dinov2.lvd142m
eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
swin_base_patch4_window7_224.ms_in22k_ft_in1k
vit_small_patch16_dinov3.lvd1689m
swin_tiny_patch4_window7_224.ms_in1k
vit_small_patch16_224.dino
tf_efficientnetv2_m.in21k_ft_in1k
A EfficientNet-v2 image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 54.1 - GMACs: 15.9 - Activations (M): 57.5 - Image size: train = 384 x 384, test = 480 x 480 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_so400m_patch16_siglip_256.v2_webli
resnet101.tv_in1k
resnet18.fb_swsl_ig1b_ft_in1k
inception_v3.tf_adv_in1k
A Inception-v3 image classification model. Adversarially trained on ImageNet-1k by paper authors. Ported from Tensorflow by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 23.8 - GMACs: 5.7 - Activations (M): 9.0 - Image size: 299 x 299 - Papers: - Rethinking the Inception Architecture for Computer Vision: https://arxiv.org/abs/1512.00567 - Adversarial Attacks and Defences Competition: https://arxiv.org/abs/1804.00097 - Original: https://github.com/tensorflow/models - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_large_patch14_clip_336.openai
Model Details The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. This instance of the CLIP model is intended for loading in `timm` (https://github.com/rwightman/pytorch-image-models) and `OpenCLIP` (https://github.com/mlfoundations/openclip) libraries. Please see https://huggingface.co/openai/clip-vit-large-patch14-336 for use in Hugging Face Transformers. Model Type The model uses a ViT-L/14 (336x336) Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. Intended Use The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Primary intended uses The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Out-of-Scope Use Cases Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Data The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Data Mission Statement Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. Limitations CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. Bias and Fairness We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
vit_base_patch16_clip_224.laion400m_e32
mobilenetv3_small_050.lamb_in1k
convnext_small.dinov3_lvd1689m
tf_efficientnetv2_s.in21k
A EfficientNet-v2 image classification model. Trained on ImageNet-21k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 48.2 - GMACs: 5.4 - Activations (M): 22.8 - Image size: train = 300 x 300, test = 384 x 384 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
ViT-B-16-SigLIP2-256
Model Details A SigLIP 2 Vision-Lanuage model trained on WebLI. This model has been converted for use in OpenCLIP from the original JAX checkpoints in Big Vision. Model Details - Model Type: Contrastive Image-Text, Zero-Shot Image Classification. - Original: https://github.com/google-research/bigvision - Dataset: WebLI - Papers: - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343
vit_large_patch14_clip_224.metaclip_2pt5b
Model card for vitlargepatch14clip224.metaclip2pt5b Model Usage This model is a dual use `openclip` and `timm` model. Model name in OpenCLIP is `ViT-L-14-quickgelu`, and timm name is `vitlargepatch14clip224.metaclip2pt5b`.
vit_large_patch16_dinov3.lvd1689m
vit_tiny_patch16_224.augreg_in21k
A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 9.7 - GMACs: 1.1 - Activations (M): 4.1 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
swin_base_patch4_window12_384.ms_in22k_ft_in1k
ViT-SO400M-14-SigLIP
mobilenetv2_100.ra_in1k
A MobileNet-v2 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: RandAugment `RA` recipe. Inspired by and evolved from EfficientNet RandAugment recipes. Published as `B` recipe in ResNet Strikes Back. RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.5 - GMACs: 0.3 - Activations (M): 6.7 - Image size: 224 x 224 - Papers: - MobileNetV2: Inverted Residuals and Linear Bottlenecks: https://arxiv.org/abs/1801.04381 - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
efficientnet_b3.ra2_in1k
A EfficientNet image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: RandAugment `RA2` recipe. Inspired by and evolved from EfficientNet RandAugment recipes. Published as `B` recipe in ResNet Strikes Back. RMSProp (TF 1.0 behaviour) optimizer, EMA weight averaging Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 12.2 - GMACs: 1.6 - Activations (M): 21.5 - Image size: train = 288 x 288, test = 320 x 320 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
deit_base_distilled_patch16_384.fb_in1k
twins_svt_large.in1k
A Twins-SVT image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 99.3 - GMACs: 15.1 - Activations (M): 35.1 - Image size: 224 x 224 - Papers: - Twins: Revisiting the Design of Spatial Attention in Vision Transformers: https://arxiv.org/abs/2104.13840 - Dataset: ImageNet-1k - Original: https://github.com/Meituan-AutoML/Twins Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_base_patch32_clip_224.laion400m_e32
densenet121.ra_in1k
vit_small_plus_patch16_dinov3.lvd1689m
tf_efficientnet_b0.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.3 - GMACs: 0.4 - Activations (M): 6.7 - Image size: 224 x 224 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_large_patch16_384.augreg_in21k_ft_in1k
Model card for vitlargepatch16384.augregin21kftin1k A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 304.7 - GMACs: 174.8 - Activations (M): 128.2 - Image size: 384 x 384 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
convnext_xxlarge.clip_laion2b_soup_ft_in1k
inception_resnet_v2.tf_in1k
ViT-SO400M-14-SigLIP-384
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI. This model has been converted to PyTorch from the original JAX checkpoints in Big Vision. These weights are usable in both OpenCLIP (image + text) and timm (image only). Model Details - Model Type: Contrastive Image-Text, Zero-Shot Image Classification. - Original: https://github.com/google-research/bigvision - Dataset: WebLI - Papers: - Sigmoid loss for language image pre-training: https://arxiv.org/abs/2303.15343
resnet101.a1h_in1k
This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: Based on ResNet Strikes Back `A1` recipe LAMB optimizer Stronger dropout, stochastic depth, and RandAugment than paper `A1` recipe Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 44.5 - GMACs: 7.8 - Activations (M): 16.2 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |
deit_base_distilled_patch16_224.fb_in1k
vit_large_patch16_224.augreg_in21k_ft_in1k
convnext_tiny.dinov3_lvd1689m
swin_base_patch4_window12_384.ms_in22k
resnet50d.ra2_in1k
efficientnet_b4.ra2_in1k
convnext_tiny.in12k
A ConvNeXt image classification model. Trained in `timm` on ImageNet-12k (a 11821 class subset of full ImageNet-22k) by Ross Wightman. ImageNet-12k training done on TPUs thanks to support of the TRC program. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 36.9 - GMACs: 4.5 - Activations (M): 13.4 - Image size: 224 x 224 - Papers: - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - Original: https://github.com/huggingface/pytorch-image-models - Dataset: ImageNet-12k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP. | model |top1 |top5 |imgsize|paramcount|gmacs |macts |samplespersec|batchsize| |------------------------------------------------------------------------------------------------------------------------------|------|------|--------|-----------|------|------|---------------|----------| | convnextv2huge.fcmaeftin22kin1k512 |88.848|98.742|512 |660.29 |600.81|413.07|28.58 |48 | | convnextv2huge.fcmaeftin22kin1k384 |88.668|98.738|384 |660.29 |337.96|232.35|50.56 |64 | | convnextxxlarge.cliplaion2bsoupftin1k |88.612|98.704|256 |846.47 |198.09|124.45|122.45 |256 | | convnextlargemlp.cliplaion2bsoupftin12kin1k384 |88.312|98.578|384 |200.13 |101.11|126.74|196.84 |256 | | convnextv2large.fcmaeftin22kin1k384 |88.196|98.532|384 |197.96 |101.1 |126.74|128.94 |128 | | convnextlargemlp.cliplaion2bsoupftin12kin1k320 |87.968|98.47 |320 |200.13 |70.21 |88.02 |283.42 |256 | | convnextxlarge.fbin22kftin1k384 |87.75 |98.556|384 |350.2 |179.2 |168.99|124.85 |192 | | convnextv2base.fcmaeftin22kin1k384 |87.646|98.422|384 |88.72 |45.21 |84.49 |209.51 |256 | | convnextlarge.fbin22kftin1k384 |87.476|98.382|384 |197.77 |101.1 |126.74|194.66 |256 | | convnextlargemlp.cliplaion2baugregftin1k |87.344|98.218|256 |200.13 |44.94 |56.33 |438.08 |256 | | convnextv2large.fcmaeftin22kin1k |87.26 |98.248|224 |197.96 |34.4 |43.13 |376.84 |256 | | convnextbase.cliplaion2baugregftin12kin1k384 |87.138|98.212|384 |88.59 |45.21 |84.49 |365.47 |256 | | convnextxlarge.fbin22kftin1k |87.002|98.208|224 |350.2 |60.98 |57.5 |368.01 |256 | | convnextbase.fbin22kftin1k384 |86.796|98.264|384 |88.59 |45.21 |84.49 |366.54 |256 | | convnextv2base.fcmaeftin22kin1k |86.74 |98.022|224 |88.72 |15.38 |28.75 |624.23 |256 | | convnextlarge.fbin22kftin1k |86.636|98.028|224 |197.77 |34.4 |43.13 |581.43 |256 | | convnextbase.cliplaionaaugregftin1k384 |86.504|97.97 |384 |88.59 |45.21 |84.49 |368.14 |256 | | convnextbase.cliplaion2baugregftin12kin1k |86.344|97.97 |256 |88.59 |20.09 |37.55 |816.14 |256 | | convnextv2huge.fcmaeftin1k |86.256|97.75 |224 |660.29 |115.0 |79.07 |154.72 |256 | | convnextsmall.in12kftin1k384 |86.182|97.92 |384 |50.22 |25.58 |63.37 |516.19 |256 | | convnextbase.cliplaion2baugregftin1k |86.154|97.68 |256 |88.59 |20.09 |37.55 |819.86 |256 | | convnextbase.fbin22kftin1k |85.822|97.866|224 |88.59 |15.38 |28.75 |1037.66 |256 | | convnextsmall.fbin22kftin1k384 |85.778|97.886|384 |50.22 |25.58 |63.37 |518.95 |256 | | convnextv2large.fcmaeftin1k |85.742|97.584|224 |197.96 |34.4 |43.13 |375.23 |256 | | convnextsmall.in12kftin1k |85.174|97.506|224 |50.22 |8.71 |21.56 |1474.31 |256 | | convnexttiny.in12kftin1k384 |85.118|97.608|384 |28.59 |13.14 |39.48 |856.76 |256 | | convnextv2tiny.fcmaeftin22kin1k384 |85.112|97.63 |384 |28.64 |13.14 |39.48 |491.32 |256 | | convnextv2base.fcmaeftin1k |84.874|97.09 |224 |88.72 |15.38 |28.75 |625.33 |256 | | convnextsmall.fbin22kftin1k |84.562|97.394|224 |50.22 |8.71 |21.56 |1478.29 |256 | | convnextlarge.fbin1k |84.282|96.892|224 |197.77 |34.4 |43.13 |584.28 |256 | | convnexttiny.in12kftin1k |84.186|97.124|224 |28.59 |4.47 |13.44 |2433.7 |256 | | convnexttiny.fbin22kftin1k384 |84.084|97.14 |384 |28.59 |13.14 |39.48 |862.95 |256 | | convnextv2tiny.fcmaeftin22kin1k |83.894|96.964|224 |28.64 |4.47 |13.44 |1452.72 |256 | | convnextbase.fbin1k |83.82 |96.746|224 |88.59 |15.38 |28.75 |1054.0 |256 | | convnextv2nano.fcmaeftin22kin1k384 |83.37 |96.742|384 |15.62 |7.22 |24.61 |801.72 |256 | | convnextsmall.fbin1k |83.142|96.434|224 |50.22 |8.71 |21.56 |1464.0 |256 | | convnextv2tiny.fcmaeftin1k |82.92 |96.284|224 |28.64 |4.47 |13.44 |1425.62 |256 | | convnexttiny.fbin22kftin1k |82.898|96.616|224 |28.59 |4.47 |13.44 |2480.88 |256 | | convnextnano.in12kftin1k |82.282|96.344|224 |15.59 |2.46 |8.37 |3926.52 |256 | | convnexttinyhnf.a2hin1k |82.216|95.852|224 |28.59 |4.47 |13.44 |2529.75 |256 | | convnexttiny.fbin1k |82.066|95.854|224 |28.59 |4.47 |13.44 |2346.26 |256 | | convnextv2nano.fcmaeftin22kin1k |82.03 |96.166|224 |15.62 |2.46 |8.37 |2300.18 |256 | | convnextv2nano.fcmaeftin1k |81.83 |95.738|224 |15.62 |2.46 |8.37 |2321.48 |256 | | convnextnanools.d1hin1k |80.866|95.246|224 |15.65 |2.65 |9.38 |3523.85 |256 | | convnextnano.d1hin1k |80.768|95.334|224 |15.59 |2.46 |8.37 |3915.58 |256 | | convnextv2pico.fcmaeftin1k |80.304|95.072|224 |9.07 |1.37 |6.1 |3274.57 |256 | | convnextpico.d1in1k |79.526|94.558|224 |9.05 |1.37 |6.1 |5686.88 |256 | | convnextpicools.d1in1k |79.522|94.692|224 |9.06 |1.43 |6.5 |5422.46 |256 | | convnextv2femto.fcmaeftin1k |78.488|93.98 |224 |5.23 |0.79 |4.57 |4264.2 |256 | | convnextfemtools.d1in1k |77.86 |93.83 |224 |5.23 |0.82 |4.87 |6910.6 |256 | | convnextfemto.d1in1k |77.454|93.68 |224 |5.22 |0.79 |4.57 |7189.92 |256 | | convnextv2atto.fcmaeftin1k |76.664|93.044|224 |3.71 |0.55 |3.81 |4728.91 |256 | | convnextattools.a2in1k |75.88 |92.846|224 |3.7 |0.58 |4.11 |7963.16 |256 | | convnextatto.d2in1k |75.664|92.9 |224 |3.7 |0.55 |3.81 |8439.22 |256 |
maxvit_nano_rw_256.sw_in1k
A timm specific MaxViT image classification model. Trained in `timm` on ImageNet-1k by Ross Wightman. ImageNet-1k training done on TPUs thanks to support of the TRC program. MaxxViT covers a number of related model architectures that share a common structure including: - CoAtNet - Combining MBConv (depthwise-separable) convolutional blocks in early stages with self-attention transformer blocks in later stages. - MaxViT - Uniform blocks across all stages, each containing a MBConv (depthwise-separable) convolution block followed by two self-attention blocks with different partitioning schemes (window followed by grid). - CoAtNeXt - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in CoAtNet. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in MaxViT. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT-V2 - A MaxxViT variation that removes the window block attention leaving only ConvNeXt blocks and grid attention w/ more width to compensate. Aside from the major variants listed above, there are more subtle changes from model to model. Any model name with the string `rw` are `timm` specific configs w/ modelling adjustments made to favour PyTorch eager use. These were created while training initial reproductions of the models so there are variations. All models with the string `tf` are models exactly matching Tensorflow based models by the original paper authors with weights ported to PyTorch. This covers a number of MaxViT models. The official CoAtNet models were never released. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 15.5 - GMACs: 4.5 - Activations (M): 30.3 - Image size: 256 x 256 - Papers: - MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/abs/2204.01697 - Dataset: ImageNet-1k Model Comparison By Top-1 |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| By Throughput (samples / sec) |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22|
resnet50_clip.openai
eva02_large_patch14_clip_224.merged2b_s4b_b131k
vit_base_patch32_224.augreg_in21k
A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 104.3 - GMACs: 4.4 - Activations (M): 4.2 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_base_patch16_clip_224.openai_ft_in12k_in1k
efficientnet_b5.sw_in12k_ft_in1k
A EfficientNet image classification model. Pretrained on ImageNet-12k and fine-tuned on ImageNet-1k by Ross Wightman in `timm` using recipe template described below. Recipe details: Based on Swin Transformer train / pretrain recipe with modifications (related to both DeiT and ConvNeXt recipes) AdamW optimizer, gradient clipping, EMA weight averaging Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 30.4 - GMACs: 9.6 - Activations (M): 93.6 - Image size: 448 x 448 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-12k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
maxvit_tiny_tf_512.in1k
An official MaxViT image classification model. Trained in tensorflow on ImageNet-1k by paper authors. Ported from official Tensorflow implementation (https://github.com/google-research/maxvit) to PyTorch by Ross Wightman. MaxxViT covers a number of related model architectures that share a common structure including: - CoAtNet - Combining MBConv (depthwise-separable) convolutional blocks in early stages with self-attention transformer blocks in later stages. - MaxViT - Uniform blocks across all stages, each containing a MBConv (depthwise-separable) convolution block followed by two self-attention blocks with different partitioning schemes (window followed by grid). - CoAtNeXt - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in CoAtNet. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT - A timm specific arch that uses ConvNeXt blocks in place of MBConv blocks in MaxViT. All normalization layers are LayerNorm (no BatchNorm). - MaxxViT-V2 - A MaxxViT variation that removes the window block attention leaving only ConvNeXt blocks and grid attention w/ more width to compensate. Aside from the major variants listed above, there are more subtle changes from model to model. Any model name with the string `rw` are `timm` specific configs w/ modelling adjustments made to favour PyTorch eager use. These were created while training initial reproductions of the models so there are variations. All models with the string `tf` are models exactly matching Tensorflow based models by the original paper authors with weights ported to PyTorch. This covers a number of MaxViT models. The official CoAtNet models were never released. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 31.0 - GMACs: 33.5 - Activations (M): 257.6 - Image size: 512 x 512 - Papers: - MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/abs/2204.01697 - Dataset: ImageNet-1k Model Comparison By Top-1 |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| By Throughput (samples / sec) |model |top1 |top5 |samples / sec |Params (M) |GMAC |Act (M)| |------------------------------------------------------------------------------------------------------------------------|----:|----:|--------------:|--------------:|-----:|------:| |coatnextnanorw224.swin1k |81.95|95.92| 2525.52| 14.70| 2.47| 12.80| |coatnetnanorw224.swin1k |81.70|95.64| 2344.52| 15.14| 2.41| 15.41| |coatnetrmlpnanorw224.swin1k |82.05|95.87| 2109.09| 15.15| 2.62| 20.34| |coatnet0rw224.swin1k |82.39|95.84| 1831.21| 27.44| 4.43| 18.73| |coatnetbn0rw224.swin1k |82.39|96.19| 1600.14| 27.44| 4.67| 22.04| |maxvitrmlppicorw256.swin1k |80.53|95.21| 1594.71| 7.52| 1.85| 24.86| |maxxvitrmlpnanorw256.swin1k |83.03|96.34| 1341.24| 16.78| 4.37| 26.05| |maxvitrmlpnanorw256.swin1k |82.96|96.26| 1283.24| 15.50| 4.47| 31.92| |maxxvitv2nanorw256.swin1k |83.11|96.33| 1276.88| 23.70| 6.26| 23.05| |maxvitnanorw256.swin1k |82.93|96.23| 1218.17| 15.45| 4.46| 30.28| |maxvittinyrw224.swin1k |83.50|96.50| 1100.53| 29.06| 5.11| 33.11| |coatnetrmlp1rw224.swin1k |83.36|96.45| 1093.03| 41.69| 7.85| 35.47| |coatnetrmlp1rw2224.swin12kftin1k |84.90|96.96| 1025.45| 41.72| 8.11| 40.13| |maxvittinytf224.in1k |83.41|96.59| 1004.94| 30.92| 5.60| 35.78| |coatnet1rw224.swin1k |83.62|96.38| 989.59| 41.72| 8.04| 34.60| |maxvitrmlptinyrw256.swin1k |84.23|96.78| 807.21| 29.15| 6.77| 46.92| |maxvitrmlpsmallrw224.swin1k |84.49|96.76| 693.82| 64.90| 10.75| 49.30| |maxvitsmalltf224.in1k |84.43|96.83| 647.96| 68.93| 11.66| 53.17| |coatnet2rw224.swin12kftin1k |86.57|97.89| 631.88| 73.87| 15.09| 49.22| |coatnetrmlp2rw224.swin1k |84.61|96.74| 625.81| 73.88| 15.18| 54.78| |coatnetrmlp2rw224.swin12kftin1k |86.49|97.90| 620.58| 73.88| 15.18| 54.78| |maxxvitrmlpsmallrw256.swin1k |84.63|97.06| 575.53| 66.01| 14.67| 58.38| |maxxvitv2rmlpbaserw224.swin12kftin1k |86.64|98.02| 501.03| 116.09| 24.20| 62.77| |maxvitrmlpbaserw224.swin12kftin1k |86.89|98.02| 375.86| 116.14| 23.15| 92.64| |maxvitbasetf224.in1k |84.85|96.99| 358.25| 119.47| 24.04| 95.01| |maxvittinytf384.in1k |85.11|97.38| 293.46| 30.98| 17.53| 123.42| |maxvitlargetf224.in1k |84.93|96.97| 247.71| 211.79| 43.68| 127.35| |maxvitsmalltf384.in1k |85.54|97.46| 188.35| 69.02| 35.87| 183.65| |coatnetrmlp2rw384.swin12kftin1k |87.39|98.31| 160.80| 73.88| 47.69| 209.43| |maxxvitv2rmlpbaserw384.swin12kftin1k |87.47|98.37| 149.49| 116.09| 72.98| 213.74| |maxvittinytf512.in1k |85.67|97.58| 144.25| 31.05| 33.49| 257.59| |maxvitrmlpbaserw384.swin12kftin1k |87.81|98.37| 106.55| 116.14| 70.97| 318.95| |maxvitbasetf384.in21kftin1k |87.92|98.54| 104.71| 119.65| 73.80| 332.90| |maxvitbasetf384.in1k |86.29|97.80| 101.09| 119.65| 73.80| 332.90| |maxvitsmalltf512.in1k |86.10|97.76| 88.63| 69.13| 67.26| 383.77| |maxvitlargetf384.in21kftin1k |87.98|98.56| 71.75| 212.03|132.55| 445.84| |maxvitlargetf384.in1k |86.23|97.69| 70.56| 212.03|132.55| 445.84| |maxvitbasetf512.in21kftin1k |88.20|98.53| 50.87| 119.88|138.02| 703.99| |maxvitbasetf512.in1k |86.60|97.92| 50.75| 119.88|138.02| 703.99| |maxvitxlargetf384.in21kftin1k |88.32|98.54| 42.53| 475.32|292.78| 668.76| |maxvitlargetf512.in21kftin1k |88.04|98.40| 36.42| 212.33|244.75| 942.15| |maxvitlargetf512.in1k |86.52|97.88| 36.04| 212.33|244.75| 942.15| |maxvitxlargetf512.in21kftin1k |88.53|98.64| 21.76| 475.77|534.14|1413.22|
tf_efficientnetv2_xl.in21k
A EfficientNet-v2 image classification model. Trained on ImageNet-21k in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 234.8 - GMACs: 52.8 - Activations (M): 139.2 - Image size: train = 384 x 384, test = 512 x 512 - Papers: - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298 - Dataset: ImageNet-21k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
regnety_120.sw_in12k_ft_in1k
mobilevit_s.cvnets_in1k
A MobileViT image classification model. Trained on ImageNet-1k by paper authors. See license details at https://github.com/apple/ml-cvnets/blob/main/LICENSE Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 5.6 - GMACs: 2.0 - Activations (M): 19.9 - Image size: 256 x 256 - Papers: - MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer: https://arxiv.org/abs/2110.02178 - Original: https://github.com/apple/ml-cvnets - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
efficientnet_b2.ra_in1k
convnextv2_tiny.fcmae_ft_in22k_in1k
ghostnet_100.in1k
repvgg_a2.rvgg_in1k
vit_base_r50_s16_384.orig_in21k_ft_in1k
A ResNet - Vision Transformer (ViT) hybrid image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 99.0 - GMACs: 61.3 - Activations (M): 81.8 - Image size: 384 x 384 - Papers: - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
pit_b_224.in1k
A PiT (Pooling based Vision Transformer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 73.8 - GMACs: 12.4 - Activations (M): 32.9 - Image size: 224 x 224 - Papers: - Rethinking Spatial Dimensions of Vision Transformers: https://arxiv.org/abs/2103.16302 - Dataset: ImageNet-1k - Original: https://github.com/naver-ai/pit Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
coat_small.in1k
A CoaT (Co-Scale Conv-Attentional Transformer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 21.7 - GMACs: 12.6 - Activations (M): 44.3 - Image size: 224 x 224 - Papers: - Co-Scale Conv-Attentional Image Transformers: https://arxiv.org/abs/2104.06399 - Dataset: ImageNet-1k - Original: https://github.com/mlpc-ucsd/CoaT Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
dm_nfnet_f0.dm_in1k
test_resnet.r160_in1k
vgg16.tv_in1k
caformer_s36.sail_in1k
A CAFormer (a MetaFormer) image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 39.3 - GMACs: 8.0 - Activations (M): 37.5 - Image size: 224 x 224 - Papers: - Metaformer baselines for vision: https://arxiv.org/abs/2210.13452 - Original: https://github.com/sail-sg/metaformer - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
ViT-B-32-SigLIP2-256
levit_256.fb_dist_in1k
visformer_small.in1k
efficientnet_b1.ra4_e3600_r240_in1k
beit_base_patch16_224.in22k_ft_in22k_in1k
A BEiT image classification model. Trained on ImageNet-22k with self-supervised masked image modelling (MIM) using a DALL-E dVAE as visual tokenizer. Fine-tuned on ImageNet-22k and then ImageNet-1k. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 86.5 - GMACs: 17.6 - Activations (M): 23.9 - Image size: 224 x 224 - Papers: - BEiT: BERT Pre-Training of Image Transformers: https://arxiv.org/abs/2106.08254 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-22k - Original: https://github.com/microsoft/unilm/tree/master/beit Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
PE-Core-bigG-14-448
nfnet_l0.ra2_in1k
convnextv2_base.fcmae_ft_in22k_in1k
A ConvNeXt-V2 image classification model. Pretrained with a fully convolutional masked autoencoder framework (FCMAE) and fine-tuned on ImageNet-22k and then ImageNet-1k. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 88.7 - GMACs: 15.4 - Activations (M): 28.8 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders: https://arxiv.org/abs/2301.00808 - Original: https://github.com/facebookresearch/ConvNeXt-V2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. All timing numbers from eager model PyTorch 1.13 on RTX 3090 w/ AMP. | model |top1 |top5 |imgsize|paramcount|gmacs |macts |samplespersec|batchsize| |------------------------------------------------------------------------------------------------------------------------------|------|------|--------|-----------|------|------|---------------|----------| | convnextv2huge.fcmaeftin22kin1k512 |88.848|98.742|512 |660.29 |600.81|413.07|28.58 |48 | | convnextv2huge.fcmaeftin22kin1k384 |88.668|98.738|384 |660.29 |337.96|232.35|50.56 |64 | | convnextxxlarge.cliplaion2bsoupftin1k |88.612|98.704|256 |846.47 |198.09|124.45|122.45 |256 | | convnextlargemlp.cliplaion2bsoupftin12kin1k384 |88.312|98.578|384 |200.13 |101.11|126.74|196.84 |256 | | convnextv2large.fcmaeftin22kin1k384 |88.196|98.532|384 |197.96 |101.1 |126.74|128.94 |128 | | convnextlargemlp.cliplaion2bsoupftin12kin1k320 |87.968|98.47 |320 |200.13 |70.21 |88.02 |283.42 |256 | | convnextxlarge.fbin22kftin1k384 |87.75 |98.556|384 |350.2 |179.2 |168.99|124.85 |192 | | convnextv2base.fcmaeftin22kin1k384 |87.646|98.422|384 |88.72 |45.21 |84.49 |209.51 |256 | | convnextlarge.fbin22kftin1k384 |87.476|98.382|384 |197.77 |101.1 |126.74|194.66 |256 | | convnextlargemlp.cliplaion2baugregftin1k |87.344|98.218|256 |200.13 |44.94 |56.33 |438.08 |256 | | convnextv2large.fcmaeftin22kin1k |87.26 |98.248|224 |197.96 |34.4 |43.13 |376.84 |256 | | convnextbase.cliplaion2baugregftin12kin1k384 |87.138|98.212|384 |88.59 |45.21 |84.49 |365.47 |256 | | convnextxlarge.fbin22kftin1k |87.002|98.208|224 |350.2 |60.98 |57.5 |368.01 |256 | | convnextbase.fbin22kftin1k384 |86.796|98.264|384 |88.59 |45.21 |84.49 |366.54 |256 | | convnextv2base.fcmaeftin22kin1k |86.74 |98.022|224 |88.72 |15.38 |28.75 |624.23 |256 | | convnextlarge.fbin22kftin1k |86.636|98.028|224 |197.77 |34.4 |43.13 |581.43 |256 | | convnextbase.cliplaionaaugregftin1k384 |86.504|97.97 |384 |88.59 |45.21 |84.49 |368.14 |256 | | convnextbase.cliplaion2baugregftin12kin1k |86.344|97.97 |256 |88.59 |20.09 |37.55 |816.14 |256 | | convnextv2huge.fcmaeftin1k |86.256|97.75 |224 |660.29 |115.0 |79.07 |154.72 |256 | | convnextsmall.in12kftin1k384 |86.182|97.92 |384 |50.22 |25.58 |63.37 |516.19 |256 | | convnextbase.cliplaion2baugregftin1k |86.154|97.68 |256 |88.59 |20.09 |37.55 |819.86 |256 | | convnextbase.fbin22kftin1k |85.822|97.866|224 |88.59 |15.38 |28.75 |1037.66 |256 | | convnextsmall.fbin22kftin1k384 |85.778|97.886|384 |50.22 |25.58 |63.37 |518.95 |256 | | convnextv2large.fcmaeftin1k |85.742|97.584|224 |197.96 |34.4 |43.13 |375.23 |256 | | convnextsmall.in12kftin1k |85.174|97.506|224 |50.22 |8.71 |21.56 |1474.31 |256 | | convnexttiny.in12kftin1k384 |85.118|97.608|384 |28.59 |13.14 |39.48 |856.76 |256 | | convnextv2tiny.fcmaeftin22kin1k384 |85.112|97.63 |384 |28.64 |13.14 |39.48 |491.32 |256 | | convnextv2base.fcmaeftin1k |84.874|97.09 |224 |88.72 |15.38 |28.75 |625.33 |256 | | convnextsmall.fbin22kftin1k |84.562|97.394|224 |50.22 |8.71 |21.56 |1478.29 |256 | | convnextlarge.fbin1k |84.282|96.892|224 |197.77 |34.4 |43.13 |584.28 |256 | | convnexttiny.in12kftin1k |84.186|97.124|224 |28.59 |4.47 |13.44 |2433.7 |256 | | convnexttiny.fbin22kftin1k384 |84.084|97.14 |384 |28.59 |13.14 |39.48 |862.95 |256 | | convnextv2tiny.fcmaeftin22kin1k |83.894|96.964|224 |28.64 |4.47 |13.44 |1452.72 |256 | | convnextbase.fbin1k |83.82 |96.746|224 |88.59 |15.38 |28.75 |1054.0 |256 | | convnextv2nano.fcmaeftin22kin1k384 |83.37 |96.742|384 |15.62 |7.22 |24.61 |801.72 |256 | | convnextsmall.fbin1k |83.142|96.434|224 |50.22 |8.71 |21.56 |1464.0 |256 | | convnextv2tiny.fcmaeftin1k |82.92 |96.284|224 |28.64 |4.47 |13.44 |1425.62 |256 | | convnexttiny.fbin22kftin1k |82.898|96.616|224 |28.59 |4.47 |13.44 |2480.88 |256 | | convnextnano.in12kftin1k |82.282|96.344|224 |15.59 |2.46 |8.37 |3926.52 |256 | | convnexttinyhnf.a2hin1k |82.216|95.852|224 |28.59 |4.47 |13.44 |2529.75 |256 | | convnexttiny.fbin1k |82.066|95.854|224 |28.59 |4.47 |13.44 |2346.26 |256 | | convnextv2nano.fcmaeftin22kin1k |82.03 |96.166|224 |15.62 |2.46 |8.37 |2300.18 |256 | | convnextv2nano.fcmaeftin1k |81.83 |95.738|224 |15.62 |2.46 |8.37 |2321.48 |256 | | convnextnanools.d1hin1k |80.866|95.246|224 |15.65 |2.65 |9.38 |3523.85 |256 | | convnextnano.d1hin1k |80.768|95.334|224 |15.59 |2.46 |8.37 |3915.58 |256 | | convnextv2pico.fcmaeftin1k |80.304|95.072|224 |9.07 |1.37 |6.1 |3274.57 |256 | | convnextpico.d1in1k |79.526|94.558|224 |9.05 |1.37 |6.1 |5686.88 |256 | | convnextpicools.d1in1k |79.522|94.692|224 |9.06 |1.43 |6.5 |5422.46 |256 | | convnextv2femto.fcmaeftin1k |78.488|93.98 |224 |5.23 |0.79 |4.57 |4264.2 |256 | | convnextfemtools.d1in1k |77.86 |93.83 |224 |5.23 |0.82 |4.87 |6910.6 |256 | | convnextfemto.d1in1k |77.454|93.68 |224 |5.22 |0.79 |4.57 |7189.92 |256 | | convnextv2atto.fcmaeftin1k |76.664|93.044|224 |3.71 |0.55 |3.81 |4728.91 |256 | | convnextattools.a2in1k |75.88 |92.846|224 |3.7 |0.58 |4.11 |7963.16 |256 | | convnextatto.d2in1k |75.664|92.9 |224 |3.7 |0.55 |3.81 |8439.22 |256 |
tf_efficientnetv2_b0.in1k
efficientnetv2_rw_m.agc_in1k
maxvit_large_tf_224.in21k
vit_base_patch16_384.augreg_in21k_ft_in1k
deit_base_patch16_224.fb_in1k
A DeiT image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 86.6 - GMACs: 17.6 - Activations (M): 23.9 - Image size: 224 x 224 - Papers: - Training data-efficient image transformers & distillation through attention: https://arxiv.org/abs/2012.12877 - Original: https://github.com/facebookresearch/deit - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_large_patch14_clip_224.laion400m_e32
lcnet_050.ra2_in1k
convnext_pico.d1_in1k
vit_small_patch16_384.augreg_in21k_ft_in1k
Model card for vitsmallpatch16384.augregin21kftin1k A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 22.2 - GMACs: 12.4 - Activations (M): 24.2 - Image size: 384 x 384 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
ViT-B-16-SigLIP
cait_m48_448.fb_dist_in1k
A CaiT (Class-Attention in Image Transformers) image classification model. Pretrained on ImageNet-1k with distillation by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 356.5 - GMACs: 329.4 - Activations (M): 1708.2 - Image size: 448 x 448 - Papers: - Going deeper with Image Transformers: https://arxiv.org/abs/2103.17239 - Dataset: ImageNet-1k - Original: https://github.com/facebookresearch/deit
vit_small_patch16_224.augreg_in21k
eva02_enormous_patch14_plus_clip_224.laion2b_s9b_b144k
wide_resnet101_2.tv_in1k
tf_efficientnet_b3.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 12.2 - GMACs: 1.9 - Activations (M): 23.8 - Image size: 300 x 300 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_base_patch16_siglip_224.v2_webli
convnext_atto.d2_in1k
resnetv2_50x1_bit.goog_in21k_ft_in1k
A ResNet-V2-BiT (Big Transfer w/ pre-activation ResNet) image classification model. Pretrained on ImageNet-21k and fine-tuned on ImageNet-1k by paper authors. This model uses: Group Normalization (GN) in combination with Weight Standardization (WS) instead of Batch Normalization (BN).. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.5 - GMACs: 16.6 - Activations (M): 44.5 - Image size: 448 x 448 - Papers: - Big Transfer (BiT): General Visual Representation Learning: https://arxiv.org/abs/1912.11370 - Identity Mappings in Deep Residual Networks: https://arxiv.org/abs/1603.05027 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/bigtransfer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
vit_base_patch16_siglip_256.webli
vit_base_patch16_224.mae
vit_tiny_patch16_384.augreg_in21k_ft_in1k
convnext_base.clip_laion2b_augreg_ft_in12k_in1k_384
vit_large_patch16_siglip_256.v2_webli
ViT-L-16-SigLIP2-512
ViT-B-16-SigLIP2
tf_efficientnet_b5.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 30.4 - GMACs: 10.5 - Activations (M): 98.9 - Image size: 456 x 456 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
fastvit_t8.apple_dist_in1k
A FastViT image classification model. Trained on ImageNet-1k with distillation by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 4.0 - GMACs: 0.7 - Activations (M): 8.6 - Image size: 256 x 256 - Papers: - FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization: https://arxiv.org/abs/2303.14189 - Original: https://github.com/apple/ml-fastvit - Dataset: ImageNet-1k
mobilenetv4_conv_small.e2400_r224_in1k
ViT-SO400M-14-SigLIP2-378
vit_base_patch16_siglip_512.v2_webli
A SigLIP 2 ViT (image encoder only) for `timm`. Equivalent to image tower from https://huggingface.co/timm/ViT-B-16-SigLIP2-512. Model Details - Dataset: webli - Papers: - SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features: https://arxiv.org/abs/2502.14786 - Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343
convnext_small.in12k_ft_in1k
convnextv2_tiny.fcmae_ft_in22k_in1k_384
resnext50_32x4d.a1h_in1k
vit_base_patch32_224.augreg_in21k_ft_in1k
A Vision Transformer (ViT) image classification model. Trained on ImageNet-21k and fine-tuned on ImageNet-1k (with additional augmentation and regularization) in JAX by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 88.2 - GMACs: 4.4 - Activations (M): 4.2 - Image size: 224 x 224 - Papers: - How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-21k - Original: https://github.com/google-research/visiontransformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
hrnet_w18.ms_aug_in1k
resnet50.am_in1k
vit_base_patch14_reg4_dinov2.lvd142m
efficientformerv2_s0.snap_dist_in1k
tf_efficientnet_b4.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 19.3 - GMACs: 4.5 - Activations (M): 49.5 - Image size: 380 x 380 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
resnet101_clip.yfcc15m
swin_small_patch4_window7_224.ms_in22k_ft_in1k
repvit_m1.dist_in1k
resnet152.a1_in1k
maxvit_tiny_rw_224.sw_in1k
resnet50.tv_in1k
resnet18.tv_in1k
convnext_base.dinov3_lvd1689m
A DINOv3 ConvNeXt image feature model. Pretrained on LVD-1689M with self-supervised DINOv3 method, distilled from DINOv3 ViT-7B. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 87.6 - GMACs: 15.4 - Activations (M): 28.8 - Image size: 224 x 224 - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models - Original: https://github.com/facebookresearch/dinov3 - Pretrain Dataset: LVD-1689M - License: DINOv3
tiny_vit_21m_512.dist_in22k_ft_in1k
A TinyViT image classification model. Pretrained on ImageNet-22k with distillation and fine-tuned on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 21.3 - GMACs: 21.2 - Activations (M): 83.3 - Image size: 512 x 512 - Papers: - TinyViT: Fast Pretraining Distillation for Small Vision Transformers: https://arxiv.org/abs/2207.10666 - Original: https://github.com/microsoft/Cream/tree/main/TinyViT - Dataset: ImageNet-1k - Pretrain Dataset: ImageNet-22k
tf_efficientnet_lite0.in1k
beitv2_base_patch16_224.in1k_ft_in22k_in1k
tf_efficientnet_b2.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 9.1 - GMACs: 1.0 - Activations (M): 13.8 - Image size: 260 x 260 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
densenet201.tv_in1k
vit_so400m_patch14_siglip_224.v2_webli
vit_large_patch14_clip_224.laion400m_e31
vit_small_r26_s32_224.augreg_in21k
resnext101_32x8d.fb_wsl_ig1b_ft_in1k
vit_small_patch8_224.dino
resnest14d.gluon_in1k
vit_base_patch16_siglip_256.v2_webli
ViT-L-16-SigLIP2-256
convnext_tiny.fb_in22k_ft_in1k
vit_large_patch16_siglip_384.v2_webli
regnetx_002.pycls_in1k
vit_7b_patch16_dinov3.lvd1689m
A DINOv3 ViT model image feature encoder. Pretrained on LVD-1689M with self-supervised DINOv3 method. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 6716.0 - GMACs: 1775.1 - Activations (M): 515.9 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
eva02_large_patch14_448.mim_m38m_ft_in22k
eca_nfnet_l0.ra2_in1k
vit_base_patch8_224.dino
eva02_base_patch16_clip_224.merged2b_s8b_b131k
convnextv2_large.fcmae_ft_in22k_in1k
eva_large_patch14_196.in22k_ft_in22k_in1k
fastvit_t8.apple_in1k
efficientnetv2_rw_s.ra2_in1k
swinv2_tiny_window8_256.ms_in1k
vit_large_patch14_clip_224.laion2b
convnext_tiny.fb_in22k
vit_base_patch16_224_miil.in21k
vit_small_patch16_dinov3_qkvb.lvd1689m
A DINOv3 ViT model image feature encoder. Distilled on LVD-1689M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 21.6 - GMACs: 6.3 - Activations (M): 17.0 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
resnet50_clip.yfcc15m
tf_efficientnetv2_l.in21k_ft_in1k
fastvit_t12.apple_in1k
resnet152.a3_in1k
resnet50_gn.a1h_in1k
seresnet50.a1_in1k
mnasnet_100.rmsp_in1k
vit_base_patch32_clip_224.laion2b_e16
convnext_base.fb_in22k_ft_in1k_384
convformer_s18.sail_in22k
dm_nfnet_f1.dm_in1k
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_320
tf_efficientnet_b1.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 7.8 - GMACs: 0.7 - Activations (M): 10.9 - Image size: 240 x 240 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
tiny_vit_21m_224.dist_in22k_ft_in1k
mobilevitv2_050.cvnets_in1k
eca_nfnet_l2.ra3_in1k
convnext_small.fb_in22k_ft_in1k
mobilenetv4_conv_medium.e500_r256_in1k
vit_huge_plus_patch16_dinov3.lvd1689m
A DINOv3 ViT model image feature encoder. Distilled on LVD-1689M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 840.5 - GMACs: 224.9 - Activations (M): 193.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: LVD-1689M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
maxvit_large_tf_512.in1k
tf_efficientnet_b7.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 66.3 - GMACs: 38.3 - Activations (M): 289.9 - Image size: 600 x 600 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
tf_efficientnet_lite1.in1k
resmlp_12_224.fb_in1k
coatnet_1_rw_224.sw_in1k
mobilenetv4_conv_small_050.e3000_r224_in1k
resnetv2_50x1_bit.goog_in21k
hrnet_w32.ms_in1k
mobileone_s1.apple_in1k
vit_base_patch16_224.augreg_in21k_ft_in1k
efficientvit_m5.r224_in1k
vit_base_patch16_224.orig_in21k
convnextv2_atto.fcmae_ft_in1k
resnet34.tv_in1k
tinynet_a.in1k
ViT-B-16-SigLIP-512
vit_base_patch16_clip_224.metaclip_2pt5b
wide_resnet101_2.tv2_in1k
vit_base_patch16_clip_224.openai_ft_in1k
pvt_v2_b2.in1k
vit_small_patch16_224.augreg_in1k
efficientnet_lite0.ra_in1k
convnextv2_base.fcmae_ft_in22k_in1k_384
vit_small_r26_s32_224.augreg_in21k_ft_in1k
deit_tiny_distilled_patch16_224.fb_in1k
coatnet_0_rw_224.sw_in1k
vgg11.tv_in1k
ese_vovnet19b_dw.ra_in1k
dm_nfnet_f3.dm_in1k
ViT-gopt-16-SigLIP2-384
regnety_002.pycls_in1k
ViT-B-16-SigLIP-384
vit_so400m_patch16_siglip_512.v2_webli
resnet152.a1h_in1k
resnet10t.c3_in1k
vit_huge_patch14_clip_224.metaclip_2pt5b
wide_resnet50_2.tv2_in1k
vit_large_patch14_clip_224.openai_ft_in1k
efficientvit_b2.r224_in1k
mixer_b16_224.goog_in21k_ft_in1k
maxvit_base_tf_384.in1k
convnext_base.fb_in22k
eva02_base_patch16_clip_224.merged2b
swinv2_tiny_window16_256.ms_in1k
vit_base_patch16_224.orig_in21k_ft_in1k
densenet121.tv_in1k
resnest50d.in1k
PE-Core-B-16
vit_small_patch32_224.augreg_in21k_ft_in1k
swinv2_base_window12to16_192to256.ms_in22k_ft_in1k
vit_large_patch14_clip_224.openai
tf_efficientnetv2_b3.in21k_ft_in1k
swinv2_base_window8_256.ms_in1k
vit_base_patch32_clip_224.metaclip_400m
wide_resnet50_2.tv_in1k
swin_large_patch4_window12_384.ms_in22k_ft_in1k
efficientnetv2_rw_t.ra2_in1k
deit_small_distilled_patch16_224.fb_in1k
fbnetc_100.rmsp_in1k
ViT-SO400M-16-SigLIP2-512
resnet101.a1_in1k
resnet50.tv2_in1k
levit_128.fb_dist_in1k
ViT-L-16-SigLIP-256
cspdarknet53.ra_in1k
efficientformerv2_l.snap_dist_in1k
resnet50_clip.cc12m
tinynet_e.in1k
cait_s24_224.fb_dist_in1k
seresnext50_32x4d.racm_in1k
regnetx_006.pycls_in1k
vit_gigantic_patch14_clip_224.metaclip_2pt5b
regnety_160.deit_in1k
swin_large_patch4_window7_224.ms_in22k
regnety_008.pycls_in1k
mobilenetv4_conv_small.e1200_r224_in1k
mobilevitv2_200.cvnets_in1k
tresnet_m.miil_in21k
rexnet_100.nav_in1k
crossvit_9_240.in1k
maxvit_base_tf_512.in21k_ft_in1k
mixnet_l.ft_in1k
swinv2_base_window12to24_192to384.ms_in22k_ft_in1k
eva02_base_patch14_224.mim_in22k
deit_base_patch16_384.fb_in1k
swin_large_patch4_window7_224.ms_in22k_ft_in1k
caformer_b36.sail_in22k_ft_in1k
convnext_small.fb_in22k_ft_in1k_384
maxvit_tiny_tf_224.in1k
ViT-L-16-SigLIP-384
swinv2_cr_tiny_ns_224.sw_in1k
vit_large_patch16_dinov3.sat493m
A DINOv3 ViT model image feature encoder. Distilled on SAT-493M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 303.1 - GMACs: 82.4 - Activations (M): 90.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: SAT-493M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
tiny_vit_21m_384.dist_in22k_ft_in1k
davit_tiny.msft_in1k
vit_large_patch32_384.orig_in21k_ft_in1k
resnet18d.ra2_in1k
resnetv2_50.a1h_in1k
swin_base_patch4_window7_224.ms_in22k
vit_base_patch32_clip_224.metaclip_2pt5b
PE-Core-L-14-336
dla34.in1k
eva02_base_patch14_448.mim_in22k_ft_in22k_in1k
Model card for eva02basepatch14448.mimin22kftin22kin1k An EVA02 image classification model. Pretrained on ImageNet-22k with masked image modeling (using EVA-CLIP as a MIM teacher) and fine-tuned on ImageNet-22k then on ImageNet-1k by paper authors. EVA-02 models are vision transformers with mean pooling, SwiGLU, Rotary Position Embeddings (ROPE), and extra LN in MLP (for Base & Large). NOTE: `timm` checkpoints are float32 for consistency with other models. Original checkpoints are float16 or bfloat16 in some cases, see originals if that's preferred. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 87.1 - GMACs: 107.1 - Activations (M): 259.1 - Image size: 448 x 448 - Papers: - EVA-02: A Visual Representation for Neon Genesis: https://arxiv.org/abs/2303.11331 - EVA-CLIP: Improved Training Techniques for CLIP at Scale: https://arxiv.org/abs/2303.15389 - Original: - https://github.com/baaivision/EVA - https://huggingface.co/Yuxin-CV/EVA-02 - Pretrain Dataset: ImageNet-22k - Dataset: ImageNet-1k Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |top1 |top5 |paramcount|imgsize| |-----------------------------------------------|------|------|-----------|--------| |eva02largepatch14448.mimm38mftin22kin1k |90.054|99.042|305.08 |448 | |eva02largepatch14448.mimin22kftin22kin1k|89.946|99.01 |305.08 |448 | |evagiantpatch14560.m30mftin22kin1k |89.792|98.992|1014.45 |560 | |eva02largepatch14448.mimin22kftin1k |89.626|98.954|305.08 |448 | |eva02largepatch14448.mimm38mftin1k |89.57 |98.918|305.08 |448 | |evagiantpatch14336.m30mftin22kin1k |89.56 |98.956|1013.01 |336 | |evagiantpatch14336.clipftin1k |89.466|98.82 |1013.01 |336 | |evalargepatch14336.in22kftin22kin1k |89.214|98.854|304.53 |336 | |evagiantpatch14224.clipftin1k |88.882|98.678|1012.56 |224 | |eva02basepatch14448.mimin22kftin22kin1k |88.692|98.722|87.12 |448 | |evalargepatch14336.in22kftin1k |88.652|98.722|304.53 |336 | |evalargepatch14196.in22kftin22kin1k |88.592|98.656|304.14 |196 | |eva02basepatch14448.mimin22kftin1k |88.23 |98.564|87.12 |448 | |evalargepatch14196.in22kftin1k |87.934|98.504|304.14 |196 | |eva02smallpatch14336.mimin22kftin1k |85.74 |97.614|22.13 |336 | |eva02tinypatch14336.mimin22kftin1k |80.658|95.524|5.76 |336 |
vit_relpos_medium_patch16_224.sw_in1k
vit_base_patch32_clip_224.laion400m_e31
vgg16_bn.tv_in1k
regnetz_040_h.ra3_in1k
convnext_large.dinov3_lvd1689m
A DINOv3 ConvNeXt image feature model. Pretrained on LVD-1689M with self-supervised DINOv3 method, distilled from DINOv3 ViT-7B. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 196.2 - GMACs: 34.4 - Activations (M): 43.1 - Image size: 224 x 224 - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - A ConvNet for the 2020s: https://arxiv.org/abs/2201.03545 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models - Original: https://github.com/facebookresearch/dinov3 - Pretrain Dataset: LVD-1689M - License: DINOv3
res2next50.in1k
maxvit_small_tf_224.in1k
vit_base_patch16_siglip_gap_224.v2_webli
mobilevit_xxs.cvnets_in1k
tf_efficientnetv2_s.in1k
ghostnetv2_100.in1k
convit_base.fb_in1k
mobilenetv2_050.lamb_in1k
maxvit_base_tf_224.in1k
convnextv2_large.fcmae_ft_in22k_in1k_384
mobilevitv2_075.cvnets_in1k
gmlp_s16_224.ra3_in1k
resnet50d.a1_in1k
eva02_large_patch14_clip_336.merged2b_s6b_b61k
hrnet_w48.ms_in1k
eva02_enormous_patch14_clip_224.laion2b_s4b_b115k
swinv2_small_window8_256.ms_in1k
vit_giant_patch14_dinov2.lvd142m
tf_efficientnetv2_m.in21k
convnext_tiny.fb_in1k
dm_nfnet_f2.dm_in1k
swinv2_small_window16_256.ms_in1k
resnet18.a3_in1k
mobilevitv2_100.cvnets_in1k
efficientnet_b1.ft_in1k
vit_base_patch16_clip_224.laion2b_ft_in12k_in1k
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k
resnetv2_101x1_bit.goog_in21k_ft_in1k
darknet53.c2ns_in1k
ViT-L-16-SigLIP2-384
repvit_m1_5.dist_450e_in1k
deit3_small_patch16_224.fb_in22k_ft_in1k
convnext_small.fb_in22k
resnest101e.in1k
mobilenetv4_conv_large.e600_r384_in1k
regnety_040.ra3_in1k
xception41.tf_in1k
resnext101_32x8d.tv_in1k
eva_giant_patch14_plus_clip_224.merged2b_s11b_b114k
vit_large_patch32_224.orig_in21k
mobilenetv3_large_100.miil_in21k
tresnet_l.miil_in1k
hrnet_w18_small_v2.gluon_in1k
eva_large_patch14_336.in22k_ft_in22k_in1k
vit_huge_patch14_224.orig_in21k
convnextv2_pico.fcmae_ft_in1k
resnet101_clip.openai
vit_base_patch32_clip_224.openai
resnet34d.ra2_in1k
davit_small.msft_in1k
deit3_small_patch16_224.fb_in1k
vit_huge_patch14_clip_224.laion2b
beit_base_patch16_224.in22k_ft_in22k
efficientvit_l2.r224_in1k
regnety_016.tv2_in1k
vit_base_patch16_clip_224.metaclip_400m
fastvit_s12.apple_in1k
convnextv2_huge.fcmae_ft_in22k_in1k_384
tiny_vit_5m_224.dist_in22k_ft_in1k
gernet_l.idstcv_in1k
pit_s_distilled_224.in1k
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k
fastvit_ma36.apple_in1k
deit3_base_patch16_224.fb_in1k
volo_d1_224.sail_in1k
vit_base_patch32_clip_224.laion2b_ft_in12k_in1k
botnet26t_256.c1_in1k
convmixer_768_32.in1k
convnext_small.in12k_ft_in1k_384
resnet50.a1h_in1k
vit_so400m_patch14_siglip_384.webli
pvt_v2_b0.in1k
tf_efficientnetv2_xl.in21k_ft_in1k
fastvit_sa12.apple_dist_in1k
mobileone_s0.apple_in1k
fbnetv3_b.ra2_in1k
spnasnet_100.rmsp_in1k
twins_pcpvt_base.in1k
caformer_s36.sail_in22k_ft_in1k_384
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k
res2net50_26w_6s.in1k
res2net50_14w_8s.in1k
pnasnet5large.tf_in1k
tf_efficientnetv2_b3.in21k
swinv2_base_window12_192.ms_in22k
regnetx_032.tv2_in1k
tiny_vit_21m_224.dist_in22k
res2net101_26w_4s.in1k
mobilevit_xs.cvnets_in1k
swin_s3_tiny_224.ms_in1k
dpn107.mx_in1k
resnext101_32x16d.fb_swsl_ig1b_ft_in1k
selecsls42b.in1k
vit_base_patch16_224.augreg_in1k
inception_v3.gluon_in1k
tf_mixnet_l.in1k
eca_botnext26ts_256.c1_in1k
cait_m36_384.fb_dist_in1k
coat_lite_mini.in1k
nasnetalarge.tf_in1k
beit_large_patch16_512.in22k_ft_in22k_in1k
swin_s3_base_224.ms_in1k
dla102.in1k
eca_halonext26ts.c1_in1k
sebotnet33ts_256.a1h_in1k
convnext_large.fb_in22k
resnetv2_101.a1h_in1k
gmixer_24_224.ra3_in1k
tiny_vit_5m_224.dist_in22k
vit_base_r50_s16_224.orig_in21k
maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k
efficientvit_b0.r224_in1k
convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384
deit3_small_patch16_384.fb_in1k
poolformer_m36.sail_in1k
mobilenetv4_conv_medium.e500_r224_in1k
vit_large_patch16_224.augreg_in21k
efficientnet_el.ra_in1k
rexnet_300.nav_in1k
xcit_large_24_p8_224.fb_in1k
vit_gigantic_patch14_clip_quickgelu_224.metaclip_2pt5b
convnextv2_nano.fcmae_ft_in22k_in1k_384
nest_base_jx.goog_in1k
A NesT image classification model. Trained on ImageNet-1k by paper authors in JAX. Ported to PyTorch by Alexander Soare. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 67.7 - GMACs: 18.0 - Activations (M): 53.4 - Image size: 224 x 224 - Papers: - Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding: https://arxiv.org/abs/2105.12723 - Dataset: ImageNet-1k - Original: https://github.com/google-research/nested-transformer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
edgenext_xx_small.in1k
ViT-SO400M-14-SigLIP2
vit_large_patch16_dinov3_qkvb.sat493m
A DINOv3 ViT model image feature encoder. Distilled on SAT-493M from the DINOv3 ViT-7B model. Model Notes The original model weights ended up with all QKV projection biases being zeroes. For `timm`, have disabled the QKV bias (`qkvbias=False`) for the models and not loaded the zero weights. For some model sizes there are variants with `qkvb` in the name that have the bias enabled (`qkvbias=True`), but zero, to match the behaviour of `transformers` and original models. The original models keep RoPE periods as a persistent `bfloat16` buffer. `timm` generates `float32` periods at init. This results in some numerical differences, however the `timm` approach should be less problematic running on devices without bfloat16 support, and appears to work as well if not slightly better for fine-tuning. `model.rope.periods = model.rope.periods.to(torch.bfloat16).to(torch.float32)` will truncate the periods to bfloat16 and result in matching outputs. Model Details - Model Type: Image Feature Encoder - Model Stats: - Params (M): 303.1 - GMACs: 82.4 - Activations (M): 90.6 - Image size: 256 x 256 - Original: https://github.com/facebookresearch/dinov3 - License: DINOv3 - Dataset: SAT-493M - Papers: - DINOv3: https://arxiv.org/abs/2508.10104 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2 - PyTorch Image Models: https://github.com/huggingface/pytorch-image-models Model Comparison See the associated paper for details on the evaluation protocols Results for ViT backbones pretrained (or distilled) on web (LVD-1689M) | Model | IN-ReaL | IN-R | Obj.Net | Ox.-H | ADE20k | NYU↓ | DAVIS | NAVI | SPair | |-------|---------|------|---------|-------|--------|------|-------|------|-------| | Global Tasks | | | | | Dense Tasks | | | | | | DINOv3 ViT-S/16 | 87.0 | 60.4 | 50.9 | 49.5 | 47.0 | 0.403 | 72.7 | 56.3 | 50.4 | | DINOv3 ViT-S+/16 | 88.0 | 68.8 | 54.6 | 50.0 | 48.8 | 0.399 | 75.5 | 57.1 | 55.2 | | DINOv3 ViT-B/16 | 89.3 | 76.7 | 64.1 | 58.5 | 51.8 | 0.373 | 77.2 | 58.8 | 57.2 | | DINOv3 ViT-L/16 | 90.2 | 88.1 | 74.8 | 63.1 | 54.9 | 0.352 | 79.9 | 62.3 | 61.3 | | DINOv3 ViT-H+/16 | 90.3 | 90.0 | 78.6 | 64.5 | 54.8 | 0.352 | 79.3 | 63.3 | 56.3 | | DINOv3 ViT-7B/16 | 90.4 | 91.1 | 91.1 | 72.8 | 55.9 | 0.309 | 79.7 | 64.4 | 58.7 | Results for ConvNeXt backbones distilled on web (LVD-1689M) | Model | IN-ReaL @256px | IN-ReaL @512px | IN-R @256px | IN-R @512px | Obj.Net @256px | Obj.Net @512px | ADE20k | NYU↓ | |-------|----------------|----------------|-------------|-------------|----------------|----------------|--------|------| | Global Tasks | | | | | | | Dense Tasks | | | DINOv3 ConvNeXt Tiny | 86.6 | 87.7 | 73.7 | 74.1 | 52.6 | 58.7 | 42.7 | 0.448 | | DINOv3 ConvNeXt Small | 87.9 | 88.7 | 73.7 | 74.1 | 52.6 | 58.7 | 44.8 | 0.432 | | DINOv3 ConvNeXt Base | 88.5 | 89.2 | 77.2 | 78.2 | 56.2 | 61.3 | 46.3 | 0.420 | | DINOv3 ConvNeXt Large | 88.9 | 89.4 | 81.3 | 82.4 | 59.3 | 65.2 | 47.8 | 0.403 | Results for ViT backbones pretrained (or distilled) on satellite (SAT-493M) | Model | m-BEnet | m-brick-kiln | m-eurosat | m-forestnet | m-pv4ger | m-so2sat | mean | |-------|---------|--------------|-----------|-------------|----------|----------|------| | DINOv3 ViT-L/16 | 73.0 | 96.5 | 94.1 | 60.6 | 96.0 | 57.4 | 79.6 | | DINOv3 ViT-7B/16 | 74.0 | 97.2 | 94.8 | 62.3 | 96.1 | 62.1 | 81.1 | | Model | m-cashew | m-chesapeake | m-NeonTree | m-nz-cattle | m-pv4ger-seg | m-SA-crop | mean | |-------|----------|--------------|------------|-------------|--------------|-----------|------| | DINOv3 ViT-L/16 | 94.2 | 75.6 | 61.8 | 83.7 | 95.2 | 36.8 | 74.5 | | DINOv3 ViT-7B/16 | 94.1 | 76.6 | 62.6 | 83.4 | 95.5 | 37.6 | 75.0 |
eva02_large_patch14_224.mim_in22k
vgg11_bn.tv_in1k
resnetrs50.tf_in1k
caformer_s18.sail_in22k_ft_in1k_384
tf_efficientnetv2_b2.in1k
vit_base_patch16_clip_224.laion400m_e31
densenet169.tv_in1k
efficientformer_l3.snap_dist_in1k
tf_mobilenetv3_large_100.in1k
mobilenetv4_hybrid_medium.e500_r224_in1k
seresnext26d_32x4d.bt_in1k
tf_efficientnet_b0.in1k
edgenext_base.usi_in1k
eva02_large_patch14_clip_224.merged2b
MobileCLIP2-S0-OpenCLIP
These weights and model card are adapted from the original Apple model at https://huggingface.co/apple/MobileCLIP2-S0. This version uses canonical OpenCLIP configs and weight naming. MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured ), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari. This repository contains the MobileCLIP2-S0 checkpoint. `MobileCLIP2-S4` matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max. `MobileCLIP-S3/S4` are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines). Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. `MobileCLIP-S2` obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples. `MobileCLIP-B (LT)` attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336. | Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:| | MobileCLIP2-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 71.5 | 59.7 | | MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | | MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | | MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | | MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | | MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | | MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | | MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | | MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | | MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | | MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | | MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | | MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | | MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 |
levit_384.fb_dist_in1k
eva02_tiny_patch14_224.mim_in22k
ViT-B-16-SigLIP2-384
convnextv2_base.fcmae_ft_in1k
resnet26d.bt_in1k
xception71.tf_in1k
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k
resnet50x4_clip.openai
pvt_v2_b1.in1k
tf_efficientnet_b6.ns_jft_in1k
A EfficientNet image classification model. Trained on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning in Tensorflow by paper authors, ported to PyTorch by Ross Wightman. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 43.0 - GMACs: 19.4 - Activations (M): 167.4 - Image size: 528 x 528 - Papers: - EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks: https://arxiv.org/abs/1905.11946 - Self-training with Noisy Student improves ImageNet classification: https://arxiv.org/abs/1911.04252 - Dataset: ImageNet-1k - Original: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
swinv2_base_window16_256.ms_in1k
mobilenetv5_300m.gemma3n
efficientnet_em.ra2_in1k
efficientformer_l1.snap_dist_in1k
fastvit_sa12.apple_in1k
vit_large_patch14_clip_224.laion2b_ft_in12k_in1k
nextvit_large.bd_ssld_6m_in1k
tf_efficientnetv2_b1.in1k
convit_tiny.fb_in1k
mobilenetv4_hybrid_large.ix_e600_r384_in1k
swinv2_large_window12to16_192to256.ms_in22k_ft_in1k
convnext_xlarge.fb_in22k_ft_in1k
convnext_nano.in12k
swiftformer_xs.dist_in1k
A SwiftFormer image classification model. Trained on ImageNet-1k by paper authors. Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 3.5 - GMACs: 0.6 - Activations (M): 6.4 - Image size: 224 x 224 - Dataset: ImageNet-1k - Papers: - SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications: https://arxiv.org/abs/2303.15446 - Original: https://github.com/Amshaker/SwiftFormer Model Comparison Explore the dataset and runtime metrics of this model in timm model results.