Installation | Configuration | Datasets | Visualization | Publications | License
Official PyTorch repository for TRI's latest published depth estimation works. Our goal is to provide a clean environment to reproduce our results and facilitate further research in this field. This repository is an updated version of PackNet-SfM, our previous monocular depth estimation repository, featuring a different license.
(Experimental) For convenient inference, we provide a growing list of our models (PackNet, DeFiNe) model over torchhub without installation.
PackNet is a self-supervised monocular depth estimation model, to load a model trained on KITTI and run inference on an RGB image:
import torch
packnet_model = torch.hub.load("TRI-ML/vidar", "PackNet", pretrained=True, trust_repo=True)
rgb_image = # 13HW torch.tensor
depth_pred = model(rgb_image)
DeFiNe is a multi-view depth estimation model, to load a model trained on Scannet and run inference on multiple posed RGB images:
import torch
define_model = torch.hub.load("TRI-ML/vidar", "DeFiNe", pretrained=True, trust_repo=True)
frames = {}
frames["rgb"] = # a list of frames as 13HW torch.tensors
frames["intrinsics"] = # a list of 133 torch.tensor intrinsics matrices (one for each image)
frames["pose"] = # a batch of 144 relative poses to reference frame (one will be identity)
depth_preds = define_model(frames) # list of depths, one for each frame
We recommend using our provided dockerfile (see nvidia-docker2 instructions) to have a reproducible environment. To set up the repository, type in a terminal (only tested in Ubuntu 18.04):
git clone --recurse-submodules https://github.com/TRI-ML/vidar.git # Clone repository with submodules
cd vidar # Move to repository folder
make docker-build # Build the docker image (recommended)
To start our docker container, simply type make docker-interactive
. From inside the docker, you can run scripts with the following command pattern:
python3 scripts/run.py <config.yaml> # Single CPU/GPU
python3 scripts/run_ddp.py <config.yaml> # Distributed Data Parallel (DDP) multi-GPU
To verify that the environment is set up correctly, you can run a simple overfit test:
# Download a tiny subset of KITTI
mkdir /data/vidar
curl -s https://tri-ml-public.s3.amazonaws.com/github/vidar/datasets/KITTI_tiny.tar | tar xv -C /data/vidar/
# Inside docker
python3 scripts/run.py configs/overfit/kitti_tiny.yaml
Once training is over (which takes around 1 minute), you should achieve results similar to this:
If you want to use features related to AWS (for dataset access) and WandB (for experiment management), you can create associated accounts and configure your shell with the following environment variables:
export AWS_SECRET_ACCESS_KEY=something # AWS secret key
export AWS_ACCESS_KEY_ID=something # AWS access key
export AWS_DEFAULT_REGION=something # AWS default region
export WANDB_ENTITY=something # WANDB entity
export WANDB_API_KEY=something # WANDB API key
Configuration files (stored in the configs
folder) are the entry points for training and inference.
The basic structure of a configuration file is:
wrapper: # Training parameters
<parameters>
arch: # Architecture used
model: # Model file and parameters
file: <model_file>
<parameters>
networks: # Networks available to the model
network1: # Network1 file and parameters
file: <network1_file>
<parameters>
network2: # Network2 file and parameters
file: <network2_file>
<parameters>
...
losses: # Losses available to the model
loss1: # Loss1 file and parameters
file: <loss1_file>
<parameters>
loss2: # Loss2 file and parameters
file: <loss2_file>
<parameters>
...
evaluation: # Evaluation metrics for different tasks
evaluation1: # Evaluation1 and parameters
<parameters>
evaluation2: # Evaluation2 and parameters
<parameters>
...
optimizers: # Optimizers used to train the networks
network1: # Optimizer for network1 and parameters
<parameters>
network2: # Optimizer for network2 and parameters
<parameters>
...
datasets: # Datasets used
train: # Training dataset and parameters
<parameters>
augmentation: # Training augmentations and parameters
<parameters>
dataloader: # Training dataloader and parameters
<parameters>
validation: # Validation dataset and parameters
<parameters>
augmentation: # Validation augmentations and parameters
<parameters>
dataloader: # Validation dataloader and parameters
<parameters>
To enable WandB logging, you can set these additional parameters in your configuration file:
wandb:
folder: /data/vidar/wandb # Where the wandb run is stored
entity: your_entity # Wandb entity
project: your_project # Wandb project
num_validation_logs: X # Number of visualization logs
tags: [tag1,tag2,...] # Wandb tags
notes: note # Wandb notes
To enable checkpoint saving, you can set these additional parameters in your configuration file:
checkpoint:
folder: /data/vidar/checkpoints # Local folder to store checkpoints
save_code: True # Save repository folder as well
keep_top: 5 # How many checkpoints should be stored
s3_bucket: s3://path/to/s3/bucket # [optional] AWS folder to store checkpoints
dataset: [0] # [optional] Validation dataset index to track
monitor: [depth|abs_rel_pp_gt(0)_0] # [optional] Validation metric to track
mode: [min] # [optional] If the metric is minimized (min) or maximized (max)
To facilitate the reutilization of configuration files, we also provide a recipe functionality, that enables parameter sharing.
To use a recipe, simply type recipe: <path/to/recipe>|<entry>
as an additional parameter, to copy all entries from that recipe onto that section. For example:
wrapper:
recipe: wrapper|default
will insert all parameters from section default
of configs/recipes/wrapper.yaml
onto the wrapper
section of the configuration file.
Parameters added after the recipe will overwrite those copied over, to facilitate customization.
In our provided configuration files, datasets are assumed to be downloaded in /data/vidar/<dataset-name>
.
For convenience, we provide links to some datasets we commonly use here (all licences still apply):
Dataset | Version | Labels | Splits |
KITTI | KITTI_raw | RGB, Depth, Poses, Intrinsics | Train / Validation / Test |
KITTI_tiny | RGB, Depth, Poses, Intrinsics | Train | |
DDAD | DDAD_trainval | Depth prediction | Train / Validation |
DDAD_tiny | Depth estimation | Train | |
DDAD_test | Depth estimation | Test | |
PD | PD_guda | Depth prediction | Train / Validation |
PD_draft | Depth estimation | Train / Validation | |
VKITTI2 | VKITTI2 | Full Virtual KITTI 2 dataset | Train |
VKITTI2_tiny | Tiny version of VKITTI2 | Train |
We also provide tools for dataset and prediction visualization, based on our CamViz library.
It is added as a submodule in the externals
folder. To use it from inside the docker, run xhost +local:
before entering it.
To visualize the information contained in different datasets, after it has been processed to be used by our repository, use the following command:
python3 demos/display_datasets/display_datasets.py <dataset>
Some examples of visualization results you will generate for KITTI and DDAD are shown below (more examples can be found in the demo configuration file demos/display_datasets/config.yaml
):
You can move the virtual viewing camera with the mouse, holding the left button to translate, the right button to rotate, and scrolling the wheel to zoom in/out. The up/down arrow keys change between temporal contexts, and the left/right arrow keys change between labels. Pressing SPACE changes between pointcloud color schemes (pixel color or per-camera).
3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020, oral)
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon
Abstract: Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide.
GT depth | Abs.Rel. | Sq.Rel. | RMSE | RMSElog | SILog | d1.25 | d1.252 | d1.253 |
ResNet18 | Self-Supervised | 192x640 | ImageNet → KITTI | ||||||||
Original | 0.116 | 0.811 | 4.902 | 0.198 | 19.259 | 0.865 | 0.957 | 0.981 |
Improved | 0.087 | 0.471 | 3.947 | 0.135 | 12.879 | 0.913 | 0.983 | 0.996 |
PackNet | Self-Supervised | 192x640 | KITTI | ||||||||
Original | 0.111 | 0.800 | 4.576 | 0.189 | 18.504 | 0.880 | 0.960 | 0.982 |
Improved | 0.078 | 0.420 | 3.485 | 0.121 | 11.725 | 0.931 | 0.986 | 0.996 |
@inproceedings{tri-packnet,
title = {3D Packing for Self-Supervised Monocular Depth Estimation},
author = {Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon},
booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
year = {2020},
}
Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon
Abstract: Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.
GT depth | Frames | Abs.Rel. | Sq.Rel. | RMSE | RMSElog | SILog | d1.25 | d1.252 | d1.253 |
DepthFormer | Self-Supervised | 192x640 | ImageNet → KITTI | |||||||||
Original | Single (t) | 0.117 | 0.876 | 4.692 | 0.193 | 18.940 | 0.874 | 0.960 | 0.981 |
Multi (t-1,t) | 0.090 | 0.661 | 4.149 | 0.175 | 17.260 | 0.905 | 0.963 | 0.982 | |
Improved | Single (t) | 0.083 | 0.464 | 3.591 | 0.126 | 12.156 | 0.926 | 0.986 | 0.996 |
Multi (t-1,t) | 0.055 | 0.271 | 2.917 | 0.095 | 9.160 | 0.955 | 0.991 | 0.998 |
@inproceedings{tri-depthformer,
title = {Multi-Frame Self-Supervised Depth with Transformers},
author = {Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon},
booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
year = {2022},
}
Full Surround Monodepth from Multiple Cameras (RA-L + ICRA 2022)
Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon
Abstract: Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive depth sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on a single monocular camera or stereo pairs that cover only a fraction of the scene around the vehicle. In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, and carefully designed photometric loss masking, we learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner. We also propose a new scale-consistent evaluation metric more suitable to multi-camera settings. Experiments on two challenging benchmarks illustrate the benefits of our approach over strong baselines.
Camera | Abs.Rel. | Sq.Rel. | RMSE | RMSElog | SILog | d1.25 | d1.252 | d1.253 |
FSM | Self-Supervised | 384x640 | ImageNet → DDAD | ||||||||
Front | 0.131 | 2.940 | 14.252 | 0.237 | 22.226 | 0.824 | 0.935 | 0.969 |
Front Right | 0.205 | 3.349 | 13.677 | 0.353 | 30.777 | 0.667 | 0.852 | 0.922 |
Back Right | 0.243 | 3.493 | 12.266 | 0.394 | 33.842 | 0.594 | 0.821 | 0.907 |
Back | 0.194 | 3.743 | 16.436 | 0.348 | 29.901 | 0.669 | 0.850 | 0.926 |
Back Left | 0.235 | 3.641 | 13.570 | 0.387 | 31.765 | 0.594 | 0.816 | 0.907 |
Front Left | 0.226 | 3.861 | 12.957 | 0.378 | 32.795 | 0.652 | 0.836 | 0.909 |
@inproceedings{tri-fsm,
title = {Full Surround Monodepth from Multiple Cameras},
author = {Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon},
booktitle = {Robotics and Automation Letters (RA-L)}
year = {2022},
}
Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew R.Walter
Abstract: Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.
@inproceedings{tri-self_calibration,
title = {Self-Supervised Camera Self-Calibration from Video},
author = {Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew Walter},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}
year = {2022},
}
Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, Matthew Walter, Adrien Gaidon
Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.
@inproceedings{tri-define,
title={Depth Field Networks For Generalizable Multi-view Scene Representation},
author={Guizilini, Vitor and Vasiljevic, Igor and Fang, Jiading and Ambru, Rare and Shakhnarovich, Greg and Walter, Matthew R and Gaidon, Adrien},
booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXII},
pages={245--262},
year={2022},
organization={Springer}
}
This repository is released under the CC BY-NC 4.0 license.