depth-anything
Depth-Anything-V2-Base-hf
--- library_name: transformers library: transformers license: cc-by-nc-4.0 tags: - depth - relative depth pipeline_tag: depth-estimation widget: - inference: false ---
Depth-Anything-V2-Large-hf
--- library_name: transformers library: transformers license: cc-by-nc-4.0 tags: - depth - relative depth pipeline_tag: depth-estimation widget: - inference: false ---
Depth-Anything-V2-Small-hf
--- license: apache-2.0 tags: - depth - relative depth pipeline_tag: depth-estimation library: transformers widget: - inference: false ---
Depth-Anything-V2-Large
--- license: cc-by-nc-4.0
Depth-Anything-V2-Metric-Outdoor-Large-hf
Depth Anything V2 (Fine-tuned for Metric Depth Estimation) - Transformers Version This model represents a fine-tuned version of Depth Anything V2 for outdoor metric depth estimation using the synthetic Virtual KITTI datasets. The model checkpoint is compatible with the transformers library. Depth Anything V2 was introduced in the paper of the same name by Lihe Yang et al. It uses the same architecture as the original Depth Anything release but employs synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions. This fine-tuned version for metric depth estimation was first released in this repository. Six metric depth models of three scales for indoor and outdoor scenes, respectively, were released and are available: | Base Model | Params | Indoor (Hypersim) | Outdoor (Virtual KITTI 2) | |:-|-:|:-:|:-:| | Depth-Anything-V2-Small | 24.8M | Model Card | Model Card | | Depth-Anything-V2-Base | 97.5M | Model Card | Model Card | | Depth-Anything-V2-Large | 335.3M | Model Card | Model Card | Depth Anything V2 leverages the DPT architecture with a DINOv2 backbone. The model is trained on ~600K synthetic labeled images and ~62 million real unlabeled images, obtaining state-of-the-art results for both relative and absolute depth estimation. Depth Anything overview. Taken from the original paper . You can use the raw model for tasks like zero-shot depth estimation. See the model hub to look for other versions on a task that interests you. Alternatively, use `transformers` latest version installed from the source: Here is how to use this model to perform zero-shot depth estimation: Alternatively, you can use the model and processor classes: For more code examples, please refer to the documentation.
Depth-Anything-V2-Metric-Indoor-Large-hf
Depth Anything V2 (Fine-tuned for Metric Depth Estimation) - Transformers Version This model represents a fine-tuned version of Depth Anything V2 for indoor metric depth estimation using the synthetic Hypersim datasets. The model checkpoint is compatible with the transformers library. Depth Anything V2 was introduced in the paper of the same name by Lihe Yang et al. It uses the same architecture as the original Depth Anything release but employs synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions. This fine-tuned version for metric depth estimation was first released in this repository. Six metric depth models of three scales for indoor and outdoor scenes, respectively, were released and are available: | Base Model | Params | Indoor (Hypersim) | Outdoor (Virtual KITTI 2) | |:-|-:|:-:|:-:| | Depth-Anything-V2-Small | 24.8M | Model Card | Model Card | | Depth-Anything-V2-Base | 97.5M | Model Card | Model Card | | Depth-Anything-V2-Large | 335.3M | Model Card | Model Card | Depth Anything V2 leverages the DPT architecture with a DINOv2 backbone. The model is trained on ~600K synthetic labeled images and ~62 million real unlabeled images, obtaining state-of-the-art results for both relative and absolute depth estimation. Depth Anything overview. Taken from the original paper . You can use the raw model for tasks like zero-shot depth estimation. See the model hub to look for other versions on a task that interests you. Alternatively, use `transformers` latest version installed from the source: Here is how to use this model to perform zero-shot depth estimation: Alternatively, you can use the model and processor classes: For more code examples, please refer to the documentation.
Depth-Anything-V2-Base
Depth-Anything-V2-Small
Depth-Anything-V2-Metric-Indoor-Base-hf
Depth-Anything-V2-Metric-Outdoor-Base-hf
Depth-Anything-V2-Metric-Indoor-Small-hf
DA3-LARGE
prompt-depth-anything-vits-hf
Depth-Anything-V2-Metric-Outdoor-Small-hf
DA3-BASE
prompt-depth-anything-vitl-hf
prompt-depth-anything-vits-transparent-hf
Video-Depth-Anything-Large
Trace Anything
Trace Anything: Representing Any Video in 4D via Trajectory Fields This repository contains the official implementation of the paper Trace Anything: Representing Any Video in 4D via Trajectory Fields. Trace Anything proposes a novel approach to represent any video as a Trajectory Field, a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. The model predicts the entire trajectory field in a single feed-forward pass, enabling applications like goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion. Project Page: https://trace-anything.github.io/ Code: https://github.com/ByteDance-Seed/TraceAnything Installation For detailed installation instructions, please refer to the GitHub repository. Sample Usage To run inference with the Trace Anything model, first, download the pretrained weights (see GitHub for details). Then, you can use the provided script as follows: Results, including 3D control points and confidence maps, will be saved to ` / /output.pt`. Interactive Visualization An interactive 3D viewer is available to explore the generated trajectory fields. Run it using: For more options and remote usage, check the GitHub repository. Citation If you find this work useful, please consider citing the paper:
Video-Depth-Anything-Small
Depth-Anything-V2-Metric-VKITTI-Large
Depth-Anything-V2-Metric-Hypersim-Large
camera-depth-model-d405
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots This repository contains the Camera Depth Model (CDM) introduced in the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. Project page: https://manipulation-as-in-simulation.github.io/ Code repository: https://github.com/ByteDance-Seed/manip-as-in-sim-suite Abstract Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies. To run depth inference on RGB-D camera data, follow these steps from the CDM inference guide in the GitHub repository: First, clone the repository and install the CDM package: Then, navigate to the `cdm` directory and run inference using `infer.py`: Citation If you use this work in your research, please cite:
camera-depth-model-kinect
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots This repository contains the Camera Depth Models (CDMs) presented in the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. CDMs are proposed as simple plugins for daily-use depth cameras. They take RGB images and raw depth signals as input and output denoised, accurate metric depth. This approach addresses challenges in using depth cameras for robotic manipulation, such as limited accuracy and noise susceptibility, effectively bridging the sim-to-real gap for manipulation tasks. Abstract Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies. Project Page For more details and resources, visit the project page: https://manipulation-as-in-simulation.github.io Code The full code, additional details, and further instructions can be found in the official GitHub repository: https://github.com/ByteDance-Seed/manip-as-in-sim-suite For specific model inference instructions, refer to the CDM inference guide on GitHub. Sample Usage To run depth inference on RGB-D camera data using CDM, follow this example from the repository:
camera-depth-model-zed2i-neural
This repository contains the Camera Depth Model (CDM) of the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. The Camera Depth Models (CDMs) are proposed as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. This enables accurate geometry perception in robots by effectively bridging the sim-to-real gap for manipulation tasks. Project page: https://manipulation-as-in-simulation.github.io/ Code: https://github.com/ByteDance-Seed/manip-as-in-sim-suite To run depth inference on RGB-D camera data, use the `infer.py` script provided in the `cdm` directory of the main repository.
camera-depth-model-zed2i-quality
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots This repository contains the Camera Depth Models (CDMs) presented in the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. Project Page: https://manipulation-as-in-simulation.github.io/ Code: https://github.com/ByteDance-Seed/manip-as-in-sim-suite Abstract Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies. Key Features Sim-to-Real Depth Transfer: Clean, metric depth estimation that matches simulation quality Multi-Camera Support: Pre-trained models for various depth sensors (RealSense, ZED, Kinect) Automated Data Generation: Scalable demonstration generation using enhanced MimicGen Whole-Body Control: Unified control for mobile manipulators for mimicgen Multi-GPU Parallelization: Distributed simulation for faster data collection VR Teleoperation: Intuitive demonstration recording using Meta Quest controllers This section provides an example of how to run depth inference using the Camera Depth Model (CDM). For more details, refer to the Model inference guide in the GitHub repository. To run depth inference on RGB-D camera data, use the following command: Citation If you use this work in your research, please cite:
camera-depth-model-l515
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots This repository contains the Camera Depth Models (CDMs) from the paper: Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. Camera Depth Models (CDMs) are proposed as a simple plugin for daily-use depth cameras. They take RGB images and raw depth signals as input and output denoised, accurate metric depth. This enables policies trained purely in simulation to transfer directly to real robots by providing nearly simulation-level accurate depth perception. Project Page: https://manipulation-as-in-simulation.github.io Code Repository: https://github.com/ByteDance-Seed/manip-as-in-sim-suite For detailed installation instructions and further usage examples, please refer to the CDM documentation in the GitHub repository. To run depth inference on RGB-D camera data, use the following command: If you use this work in your research, please cite:
Depth-Anything-V2-Metric-Hypersim-Small
Depth-Anything-V2-Metric-VKITTI-Small
prompt-depth-anything-vitl
camera-depth-model-d435
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots This repository contains the Camera Depth Models (CDMs) from the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots. CDMs are proposed as a simple plugin for daily-use depth cameras, taking RGB images and raw depth signals as input to output denoised, accurate metric depth. This enables policies trained purely in simulation to transfer directly to real robots, effectively bridging the sim-to-real gap for manipulation tasks. Project page: https://manipulation-as-in-simulation.github.io/ Code repository: https://github.com/ByteDance-Seed/manip-as-in-sim-suite To run depth inference on RGB-D camera data, follow the example from the GitHub repository's CDM section: