tue-mps
coco_panoptic_eomt_large_640
EoMT (Encoder-only Mask Transformer) is a Vision Transformer (ViT) architecture designed for high-quality and efficient image segmentation. It was introduced in the CVPR 2025 highlight paper: Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. > Key Insight: Given sufficient scale and pretraining, a plain ViT along with additional few params can perform segmentation without the need for task-specific decoders or pixel fusion modules. The same model backbone supports semantic, instance, and panoptic segmentation with different post-processing 🤗 The original implementation can be found in this repository. The HuggingFace model page is available at this link. Here is how to use this model for Panotpic Segmentation: Citation If you find our work useful, please consider citing us as:
coco_instance_eomt_large_1280
ade20k_panoptic_eomt_large_1280
ade20k_semantic_eomt_large_512
coco_instance_eomt_large_640
coco_panoptic_eomt_base_640_2x
cityscapes_semantic_eomt_large_1024
videomt-dinov2-small-ytvis2019
coco_panoptic_eomt_small_640_2x
ade20k_panoptic_eomt_giant_1280
eomt-dinov3-coco-panoptic-small-640
coco_panoptic_eomt_7b_640
eomt-dinov3-coco-panoptic-large-640
eomt-dinov3-coco-instance-large-1280
eomt-dinov3-coco-instance-large-640
eomt-dinov3-ade-semantic-large-512
ade20k_panoptic_eomt_giant_640
ade20k_panoptic_eomt_large_640
eomt-dinov3-coco-panoptic-large-1280
eomt-dinov3-coco-panoptic-base-640
coco_panoptic_eomt_large_1280
coco_panoptic_eomt_giant_1280
coco_panoptic_eomt_giant_640
simple-tad
coco_instance_eomt_large_640_dinov3
EoMT (Encoder-only Mask Transformer) is a Vision Transformer (ViT) architecture designed for high-quality and efficient image segmentation. It was introduced in the CVPR 2025 highlight paper: Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. > Key Insight: Given sufficient scale and pretraining, a plain ViT along with additional few params can perform segmentation without the need for task-specific decoders or pixel fusion modules. The same model backbone supports semantic, instance, and panoptic segmentation with different post-processing 🤗 The original implementation can be found in this repository. The HuggingFace model page is available at this link. Citation If you find our work useful, please consider citing us as: