Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

Setup

Use anaconda to create a Python 3.8 environment:

conda create -n habitat python3.8
conda activate habitat

Install Habitat-Sim 0.2.1:

conda install -c aihabitat -c conda-forge habitat-sim=0.2.1 headless

Install Habitat-Lab 0.2.1:

git clone --branch v0.2.1 [email protected]:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all

Clone this repository and install python requirements:

git clone https://github.com/RavenKiller/MEE.git
cd MEE
pip install -r requirements.txt

Download Matterport3D scenes:
- Get the official download_mp.py from Matterport3D project webpage
- Download scene data for Habitat
```
# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
```
- Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.
Download pre-processed episodes from here. Extract it into data/datasets/.
Download the depth encoder from here. Extract the model to data/ddppo-models/gibson-4plus-resnet50.pth.

Evolutionary pre-training dataset

We proposed an evolutionary pre-training strategy in this work and developed the corresponding datasets. The data collecting scripts are stored in scripts/ with filenames like evo_data_stage1.ipynb.

V1

The v1 version (default access code: evop) contains a total of 4.8M samples of all modalities. All data is organized in HDF5 format. The total size after decompression is around 720 GB. Below is the file list:

stage1.zip
- rgb.mat: contains RGB data with shape (395439, 224, 224, 3)
- depth.mat: contains depth data with shape (417900, 256, 256, 1)
- inst.mat: contains instruction data with shape (400250, 77), zero-padded, and tokenized
- sub.mat: contains sub-instruction data with shape (410357, 12, 77)
stage2.zip
- rgb_depth_large.mat: contains aligned RGB and depth data, a total of 230766 pairs
- inst_sub_large.mat: contains aligned instruction and sub-instruction data, a total of 157877 pairs
- rgb_depth.mat: contains a small debug version
- inst_sub.mat: contains a small debug version
stage3.zip
- data.mat: contains aligned (RGB, depth, instruction, sub-instruction), a total of 601038 tuples

The data source includes:

stage 1: COCO, VisualGenome, RGBD1K, SceneNet Depth, and BookCorpus.
stage 2: NYUv2, DIODE, TUM RGB-D, Bonn RGB-D Dynamic, SceneNet RGB-D,Touchdown, map2seq, CHALET, Talk the Walk, and ALFRED.
stage 3: VLN-CE and EnvDrop.

V2

The v2 version contains a total of 83.9M samples of all modalities, which is a superset of v1. All data are stored in seperated files (RGB: JPEG, Depth: PNG, Instruction: TXT, Sub-instruction: TXT). Collecting and loading scripts are developed in the dev branch.

Additional data sources: ImageNet, LAION-HighResolution, CC-12M, C4, HM3D, SUN3D, ScanNet, Marky-gibson.

Access of several datasets are subject to specific terms and conditions (e.g., HM3D). Please request the access before using them.

Train, evaluate and test

run.py is the program entrance. You can run it like this:

python run.py \
  --exp-config {config} \
  --run-type {type}

{config} should be replaced by a config file path; {type} should be train, eval, or inference, meaning train, evaluate, and test models.

Our config files are stored in evoenc/config/:

File	Meaning
`evoenc.yaml`	Training model with behavior cloning
`evoenc_da.yaml`	Training model with DAgger
`evoenc_aug.yaml`	Training model with EnvDrop
`evoenc_p{x}.yaml`	Evolutionary pre-training stage {x}+1
`evoenc_p{x}_tune.yaml`	Task fine-tuning with DAgger

Several paths (like pre-training data folder and checkpoint paths) are configured by the above YAML files or the evoenc/config/default.py. Remember to change them as needed.

Pre-trained weights

[stage 1] [stage 2] [stage 3]

We release pre-trained encoder weights after evolutionary pre-training. We exclude the frozen pre-extractor in these weights to reduce the storage cost. Refer to the code evoenc/models/evoenc_policy.py to load pre-trained weights.

Visualization

Unified feature spaces

Evolved encoder performance

Comparison with the baseline:

navigation_vlnce.mp4

Failure cases

Premature stop

premature_stop.mp4

Wrong exploration

wrong_exploration.mp4

Deadlock

deadlock.mp4

Real scene navigation

Alkaid Robot

Alkaid is a self-developed interactive service robot. Here are some parameters:

Camera: 720P resolution, 90° max FOV
Screen: 1080P, touch screen
Microphone: 4-microphone circular array, 61dB SNR
Speaker: 2 stereo units, 150Hz-20kHz output
Chassis: 2-wheel differential drive, 0.5m/s max speed, 1.2rad/s max angular speed

Demonstration

Currently, we release 13 paths with VLN-CE format. The video below demonstrates 4 paths.

navigation_vlntj.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
evoenc		evoenc
habitat_extensions		habitat_extensions
resources		resources
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

Setup

Evolutionary pre-training dataset

V1

V2

Train, evaluate and test

Pre-trained weights

Visualization

Unified feature spaces

Evolved encoder performance

Comparison with the baseline:

Failure cases

Real scene navigation

Alkaid Robot

Demonstration

About

Releases

Packages

Languages

License

RavenKiller/MEE

Folders and files

Latest commit

History

Repository files navigation

Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

Setup

Evolutionary pre-training dataset

V1

V2

Train, evaluate and test

Pre-trained weights

Visualization

Unified feature spaces

Evolved encoder performance

Comparison with the baseline:

Failure cases

Real scene navigation

Alkaid Robot

Demonstration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages