zju-community

2 models • 1 total models in database

Sort by:

efficientloftr

The Efficient LoFTR model was proposed in "Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed" by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou from Zhejiang University. This model presents a novel method for efficiently producing semi-dense matches across images, addressing the limitations of previous detector-free matchers like LoFTR, which suffered from low efficiency despite remarkable matching capabilities in challenging scenarios. Efficient LoFTR revisits design choices to improve both efficiency and accuracy. "We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LOFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is ~2.5 faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr/" This model was contributed by stevenbucaille. The original code can be found here. Efficient LoFTR is a neural network designed for semi-dense local feature matching across images, building upon and significantly improving the detector-free matcher LoFTR. The key innovations include: An aggregated attention mechanism with adaptive token selection for efficient feature transformation, addressing the redundancy of performing transformers over the entire feature map due to shared local information. This mechanism significantly reduces the cost of local feature transformation by aggregating features for salient tokens and utilizing vanilla attention with relative positional encoding. A novel two-stage correlation layer for accurate subpixel correspondence refinement. This module first locates pixel-level matches using mutual-nearest-neighbor (MNN) matching on fine feature patches and then refines them for subpixel accuracy by performing correlation and expectation locally within tiny patches, thereby addressing spatial variance observed in LoFTR's refinement phase. The model is designed to be highly efficient, with its optimized version being approximately 2.5 times faster than LoFTR and capable of surpassing efficient sparse matching pipelines like SuperPoint + LightGlue, while also achieving higher accuracy than competitive semi-dense matchers. It processes images at a resolution of 640x480 with an optimized running time of 27.0 ms using Mixed-Precision. - Developed by: ZJU3DV at Zhejiang University - Model type: Image Matching - License: Apache 2.0 - Repository: https://github.com/zju3dv/efficientloftr - Project page: https://zju3dv.github.io/efficientloftr/ - Paper: https://huggingface.co/papers/2403.04765 Efficient LoFTR is designed for large-scale or latency-sensitive applications that require robust image matching. Its direct uses include: - Image retrieval - 3D reconstruction - Homography estimation - Relative pose recovery - Visual localization Here is a quick example of using the model. Since this model is an image matching model, it requires pairs of images to be matched. The raw outputs contain the list of keypoints detected by the backbone as well as the list of matches with their corresponding matching scores. You can use the `postprocesskeypointmatching` method from the `LightGlueImageProcessor` to get the keypoints and matches in a readable format: You can visualize the matches between the images by providing the original images as well as the outputs to this method: Efficient LoFTR is trained end-to-end using a coarse-to-fine matching pipeline. The model is trained on the MegaDepth dataset , a large-scale outdoor dataset. - Optimizer: AdamW - Initial Learning Rate: 4×10^−3 - Batch Size: 16 - Training Hardware: 8 NVIDIA V100 GPUs - Training Time: Approximately 15 hours Efficient LoFTR demonstrates significant improvements in efficiency: Speed: The optimized model is approximately 2.5 times faster than LoFTR. It can surpass the efficient sparse matcher LightGlue. For 640x480 resolution image pairs on a single NVIDIA RTX 3090 GPU, the optimized model's processing time is 35.6 ms (FP32) / 27.0 ms (Mixed-Precision). Accuracy: The method achieves higher accuracy compared to competitive semi-dense matchers and competitive accuracy compared with semi-dense matchers at a significantly higher speed.

license:apache-2.0

4,037

matchanything_eloftr

The MatchAnything-ELOFTR model was proposed in "MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training" by Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, and Xiaowei Zhou from Zhejiang University and Shandong University. This model is a version of ELOFTR enhanced by the MatchAnything pre-training framework. This framework enables the model to achieve universal cross-modality image matching capabilities, overcoming the significant challenge of matching images with drastic appearance changes due to different imaging principles (e.g., thermal vs. visible, CT vs. MRI). This is achieved by pre-training on a massive, diverse dataset synthesized with cross-modal stimulus signals, teaching the model to recognize fundamental, appearance-insensitive structures. "Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence (AI) analysis and beyond." This model was contributed by stevenbucaille. The original code for the MatchAnything project can be found here. MatchAnything-ELOFTR is a semi-dense feature matcher that has been pre-trained using the novel MatchAnything framework to give it powerful generalization capabilities for cross-modality tasks. The core innovations stem from the training framework, not the model architecture itself, which remains that of ELOFTR. The key innovations of the MatchAnything framework include: - A multi-resource dataset mixture training engine that combines various data sources to ensure diversity. This includes multi-view images with 3D reconstructions, large-scale unlabelled video sequences, and vast single-image datasets. - A cross-modality stimulus data generator that uses image generation techniques (like style transfer and depth estimation) to create synthetic, pixel-aligned cross-modal training pairs (e.g., visible-to-thermal, visible-to-depth). - This process trains the model to learn appearance-insensitive, fundamental image structures, allowing a single set of model weights to perform robustly on over eight different and completely unseen cross-modal matching tasks. - Developed by: ZJU3DV at Zhejiang University & Shandong University - Model type: Image Matching - License: Apache 2.0 - Repository: https://github.com/zju3dv/MatchAnything - Project page: https://zju3dv.github.io/MatchAnything/ - Paper: https://huggingface.co/papers/2501.07556 MatchAnything-ELOFTR is designed for a vast array of applications requiring robust image matching, especially between different sensor types or imaging modalities. Its direct uses include: - Medical Image Analysis: Aligning CT-MR, PET-MR, and SPECT-MR scans. - Histopathology: Registering tissue images with different stains (e.g., H&E and IHC). - Remote Sensing: Matching satellite/aerial images from different sensors (e.g., Visible-SAR, Thermal-Visible). - Autonomous Systems: Enhancing localization and navigation for UAVs and autonomous vehicles by matching thermal or visible images to vectorized maps. - Single-Modality Tasks: The model also retains strong performance on standard single-modality matching, such as retina image registration. Here is a quick example of using the model for matching a pair of images. Make sure to use transformers from the following commit as a fix for this model got merged on main but is still not part of a released version : You can use the postprocesskeypointmatching method from the `EfficientLoFTRImageProcessor` to get the keypoints and matches in a readable format: You can also visualize the matches between the images: Training Details MatchAnything-ELOFTR is trained end-to-end using the large-scale, cross-modality pre-training framework. Training Data The model was not trained on a single dataset but on a massive collection generated by the Multi-Resources Data Mixture Training framework, totaling approximately 800 million image pairs. This framework leverages: Multi-View Images with Geometry: Datasets like MegaDepth, ScanNet++, and BlendedMVS provide realistic viewpoint changes with ground-truth depth. Video Sequences: The DL3DV-10k dataset is used, with pseudo ground-truth matches generated between distant frames via a novel coarse-to-fine strategy. Single-Image Datasets: Large datasets like GoogleLandmark and SA-1B are used with synthetic homography warping to maximize data diversity. Cross-Modality Stimulus Data: A key component where training pairs are augmented by generating synthetic modalities (thermal, nighttime, depth maps) from visible light images using models like CycleGAN and DepthAnything, encouraging the matcher to learn appearance-invariant features. Optimizer: AdamW Initial Learning Rate: 8×10⁻³ Batch Size: 64 Training Hardware: 16 NVIDIA A100-80G GPUs Training Time: Approximately 4.3 days for the ELOFTR variant Speeds, Sizes, Times Since the MatchAnything framework only changes the training process and weights, the model's architecture and running time are identical to the original ELOFTR model. Speed: For a 640x480 resolution image pair on a single NVIDIA RTX 3090 GPU, the model takes 40ms to process.

license:apache-2.0

3,432