Skip to content

RavenKiller/TAC

Repository files navigation

Learning Depth Representation from RGB-D Videos
by Time-Aware Contrastive Pre-training



Existing end-to-end depth representation in embodied AI is often task-specific and lacks the benefits of emerging pre-training paradigm due to limited datasets and training techniques for RGB-D videos. To address the challenge of obtaining robust and generalized depth representation for embodied AI, we introduce a unified RGB-D video dataset (UniRGBD) and a novel time-aware contrastive (TAC) pre-training approach. UniRGBD addresses the scarcity of large-scale depth pre-training datasets by providing a comprehensive collection of data from diverse sources in a unified format, enabling convenient data loading and accommodating various data domains. We also design an RGB-Depth alignment evaluation procedure and introduce a novel Near-K accuracy metric to assess the scene understanding capability of the depth encoder. Then, the TAC pre-training approach fills the gap in depth pre-training methods suitable for RGB-D videos by leveraging the intrinsic similarity between temporally proximate frames. TAC incorporates a soft label design that acts as valid label noise, enhancing the depth semantic extraction and promoting diverse and generalized knowledge acquisition. Furthermore, the adjustments in perspective between temporally proximate frames facilitate the extraction of invariant and comprehensive features, enhancing the robustness of the learned depth representation. Additionally, the inclusion of temporal information stabilizes training gradients and enables spatio-temporal depth perception. Comprehensive evaluation of RGB-Depth alignment demonstrates the superiority of our approach over state-of-the-art methods. We also conduct uncertainty analysis and a novel zero-shot experiment to validate the robustness and generalization of the TAC approach. Moreover, our TAC pre-training demonstrates significant performance improvements in various embodied AI tasks, providing compelling evidence of its efficacy across diverse domains.

Setup environments

  1. Use anaconda to create a Python 3.8 environment:
conda create -n py38 python3.8
conda activate py38
  1. Install requirements
pip install -r requirements

UniRGBD dataset

An unified and universal RGB-D database for depth representation pre-training.

drawing

The script for unifying various RGB-D frames to generate UniRGBD is scripts/rgbd_data.ipynb. You can download our pre-processed version (split into several parts due to too large size): [HM3D][SceneNet][SUN3D][TUM, DIODE, NYUv2][Evaluation data (with ScanNet)][Outdoor data (from RGBD1K and DIML)]. The access code is tacp.

Important: HM3D is free for academic, non-commercial research, but requires the access from Mattterport. After getting the access and 3D scenes, you can run scripts/hm3d_data.mp.py to generate RGB-D frames or download the pre-processed version.

After decompression, the folder structure will be like (there may exist a few redundant folders):

data/rgbd_data/
├── diode_clean_resize
│   └── train
│       ├── indoors
│       └── outdoor
├── hm3d_rgbd
│   └── train
│       ├── 0
│       ├── 1
│       └── ...
├── nyuv2_resize
│   ├── all
│   ├── train
│   └── val
├── pretrain_val
│   ├── diode_val
│   ├── hm3d_val
│   ├── nyuv2_val
│   ├── scannet_val
│   ├── scenenet_val500
│   ├── sun3d_val
│   └── tum_val
├── scenenet_resize
│   └── train
│       ├── 0
│       ├── 1
│       └── ...
├── sun3d
│   ├── train
└── tumrgbd_clean_resize
    ├── train

Note that all path variables in scripts are absolute, so remember to change them as needed. You can add arbitrary new data by appending the new folder to _C.DATA.RGBD.data_path in config/default.py.

Oringinal data source links: HM3D, SceneNet, SUN3D, TUM, DIODE, NYUv2, ScanNet, RGBD1K and DIML.

Run pre-training

train

train.sh is used for training on single GPU; multi_proc.sh is used for training on multiple GPUs. The pre-trained weights will be stored in data/checkpoints. All configuration files are in the config folder.

evaluate

eval.sh supplies the standard evaluation procedure, including non-shuffle, block-shuffle, shuffle and out-of-domain. Metrics calculation can be found in trainers/dist_trainer.py. The evaluation results will be stored in data/checkpoints/{}/evals.

check evaluation order

For fair comparison, we supply the standard evaluation order files here. Run generate_eval_order.sh to compare whether the evaluation orders are the same as ours.

Evaluation performance

Shuffle Top-1 Block-shuffle Near-1 Non-shuffle Near-1 Out-domain Top-1
TAC 0.974 0.642 0.603 0.850

Customize usage

scripts/demo.ipynb gives a simple demonstration of encoding a depth image. You can also separate the depth encoder apart from the whole model as needed.

Pretrained weight

[Checkpoint]

Extended experiments

  1. scripts/uncertainty.ipynb: Conduct the MC Dropout uncertainty analysis.
  2. scripts/zero_shot.ipynb: Conduct zero-shot room classification by depth images.
  3. scripts/mae and config/v2/v2_mae.yaml: Train cross-modal masked autoencoder model.
  4. config/v2/v2_edge.yaml: RGBD alignment by Canny edge detection.
  5. config/v2/v2_tac_outdoortune.yaml: Fine-tune model with a few outdoor frames.

Embodied experiments

Experiment codes are stored in here.

Visualization

PointNav

image

VLN

image

EQA

image

Rearrange

image

Citation

@ARTICLE{10288539,
  author={He, Zongtao and Wang, Liuyi and Dang, Ronghao and Li, Shu and Yan, Qingqing and Liu, Chengju and Chen, Qijun},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Learning Depth Representation From RGB-D Videos by Time-Aware Contrastive Pre-Training}, 
  year={2024},
  volume={34},
  number={6},
  pages={4143-4158},
  doi={10.1109/TCSVT.2023.3326373}}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages