Learning Depth Representation from RGB-D Videos
by Time-Aware Contrastive Pre-training

Existing end-to-end depth representation in embodied AI is often task-specific and lacks the benefits of emerging pre-training paradigm due to limited datasets and training techniques for RGB-D videos. To address the challenge of obtaining robust and generalized depth representation for embodied AI, we introduce a unified RGB-D video dataset (UniRGBD) and a novel time-aware contrastive (TAC) pre-training approach. UniRGBD addresses the scarcity of large-scale depth pre-training datasets by providing a comprehensive collection of data from diverse sources in a unified format, enabling convenient data loading and accommodating various data domains. We also design an RGB-Depth alignment evaluation procedure and introduce a novel Near-K accuracy metric to assess the scene understanding capability of the depth encoder. Then, the TAC pre-training approach fills the gap in depth pre-training methods suitable for RGB-D videos by leveraging the intrinsic similarity between temporally proximate frames. TAC incorporates a soft label design that acts as valid label noise, enhancing the depth semantic extraction and promoting diverse and generalized knowledge acquisition. Furthermore, the adjustments in perspective between temporally proximate frames facilitate the extraction of invariant and comprehensive features, enhancing the robustness of the learned depth representation. Additionally, the inclusion of temporal information stabilizes training gradients and enables spatio-temporal depth perception. Comprehensive evaluation of RGB-Depth alignment demonstrates the superiority of our approach over state-of-the-art methods. We also conduct uncertainty analysis and a novel zero-shot experiment to validate the robustness and generalization of the TAC approach. Moreover, our TAC pre-training demonstrates significant performance improvements in various embodied AI tasks, providing compelling evidence of its efficacy across diverse domains.

Setup environments

Use anaconda to create a Python 3.8 environment:

conda create -n py38 python3.8
conda activate py38

Install requirements

pip install -r requirements

UniRGBD dataset

An unified and universal RGB-D database for depth representation pre-training.

The script for unifying various RGB-D frames to generate UniRGBD is scripts/rgbd_data.ipynb. You can download our pre-processed version (split into several parts due to too large size): [HM3D][SceneNet][SUN3D][TUM, DIODE, NYUv2][Evaluation data (with ScanNet)][Outdoor data (from RGBD1K and DIML)]. The access code is tacp.

Important: HM3D is free for academic, non-commercial research, but requires the access from Mattterport. After getting the access and 3D scenes, you can run scripts/hm3d_data.mp.py to generate RGB-D frames or download the pre-processed version.

After decompression, the folder structure will be like (there may exist a few redundant folders):

data/rgbd_data/
├── diode_clean_resize
│   └── train
│       ├── indoors
│       └── outdoor
├── hm3d_rgbd
│   └── train
│       ├── 0
│       ├── 1
│       └── ...
├── nyuv2_resize
│   ├── all
│   ├── train
│   └── val
├── pretrain_val
│   ├── diode_val
│   ├── hm3d_val
│   ├── nyuv2_val
│   ├── scannet_val
│   ├── scenenet_val500
│   ├── sun3d_val
│   └── tum_val
├── scenenet_resize
│   └── train
│       ├── 0
│       ├── 1
│       └── ...
├── sun3d
│   ├── train
└── tumrgbd_clean_resize
    ├── train

Note that all path variables in scripts are absolute, so remember to change them as needed. You can add arbitrary new data by appending the new folder to _C.DATA.RGBD.data_path in config/default.py.

Oringinal data source links: HM3D, SceneNet, SUN3D, TUM, DIODE, NYUv2, ScanNet, RGBD1K and DIML.

Run pre-training

train

train.sh is used for training on single GPU; multi_proc.sh is used for training on multiple GPUs. The pre-trained weights will be stored in data/checkpoints. All configuration files are in the config folder.

evaluate

eval.sh supplies the standard evaluation procedure, including non-shuffle, block-shuffle, shuffle and out-of-domain. Metrics calculation can be found in trainers/dist_trainer.py. The evaluation results will be stored in data/checkpoints/{}/evals.

check evaluation order

For fair comparison, we supply the standard evaluation order files here. Run generate_eval_order.sh to compare whether the evaluation orders are the same as ours.

Evaluation performance

	Shuffle Top-1	Block-shuffle Near-1	Non-shuffle Near-1	Out-domain Top-1
TAC	0.974	0.642	0.603	0.850

Customize usage

scripts/demo.ipynb gives a simple demonstration of encoding a depth image. You can also separate the depth encoder apart from the whole model as needed.

Pretrained weight

[Checkpoint]

Extended experiments

scripts/uncertainty.ipynb: Conduct the MC Dropout uncertainty analysis.
scripts/zero_shot.ipynb: Conduct zero-shot room classification by depth images.
scripts/mae and config/v2/v2_mae.yaml: Train cross-modal masked autoencoder model.
config/v2/v2_edge.yaml: RGBD alignment by Canny edge detection.
config/v2/v2_tac_outdoortune.yaml: Fine-tune model with a few outdoor frames.

Embodied experiments

Experiment codes are stored in here.

Visualization

PointNav

VLN

EQA

Rearrange

Citation

@ARTICLE{10288539,
  author={He, Zongtao and Wang, Liuyi and Dang, Ronghao and Li, Shu and Yan, Qingqing and Liu, Chengju and Chen, Qijun},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Learning Depth Representation From RGB-D Videos by Time-Aware Contrastive Pre-Training}, 
  year={2024},
  volume={34},
  number={6},
  pages={4143-4158},
  doi={10.1109/TCSVT.2023.3326373}}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
common		common
config		config
dataloaders		dataloaders
models		models
resources		resources
scripts		scripts
trainers		trainers
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
eval.sh		eval.sh
generate_eval_order.sh		generate_eval_order.sh
imports.py		imports.py
multi_proc.sh		multi_proc.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_dist.py		run_dist.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Depth Representation from RGB-D Videos
by Time-Aware Contrastive Pre-training

Setup environments

UniRGBD dataset

Run pre-training

train

evaluate

check evaluation order

Evaluation performance

Customize usage

Pretrained weight

Extended experiments

Embodied experiments

Visualization

Citation

About

Releases

Packages

Languages

License

RavenKiller/TAC

Folders and files

Latest commit

History

Repository files navigation

Learning Depth Representation from RGB-D Videos by Time-Aware Contrastive Pre-training

Setup environments

UniRGBD dataset

Run pre-training

train

evaluate

check evaluation order

Evaluation performance

Customize usage

Pretrained weight

Extended experiments

Embodied experiments

Visualization

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Learning Depth Representation from RGB-D Videos
by Time-Aware Contrastive Pre-training

Packages