Skip to content

RavenKiller/MEE

Repository files navigation

Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

[Project page]

Setup

  1. Use anaconda to create a Python 3.8 environment:
conda create -n habitat python3.8
conda activate habitat
  1. Install Habitat-Sim 0.2.1:
conda install -c aihabitat -c conda-forge habitat-sim=0.2.1 headless
  1. Install Habitat-Lab 0.2.1:
git clone --branch v0.2.1 [email protected]:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all
  1. Clone this repository and install python requirements:
git clone https://github.com/RavenKiller/MEE.git
cd MEE
pip install -r requirements.txt
  1. Download Matterport3D scenes:
    # requires running with python 2.7
    python download_mp.py --task habitat -o data/scene_datasets/mp3d/
    • Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.
  2. Download pre-processed episodes from here. Extract it into data/datasets/.
  3. Download the depth encoder from here. Extract the model to data/ddppo-models/gibson-4plus-resnet50.pth.

Evolutionary pre-training dataset

We proposed an evolutionary pre-training strategy in this work and developed the corresponding datasets. The data collecting scripts are stored in scripts/ with filenames like evo_data_stage1.ipynb.

V1

The v1 version (default access code: evop) contains a total of 4.8M samples of all modalities. All data is organized in HDF5 format. The total size after decompression is around 720 GB. Below is the file list:

  • stage1.zip
    • rgb.mat: contains RGB data with shape (395439, 224, 224, 3)
    • depth.mat: contains depth data with shape (417900, 256, 256, 1)
    • inst.mat: contains instruction data with shape (400250, 77), zero-padded, and tokenized
    • sub.mat: contains sub-instruction data with shape (410357, 12, 77)
  • stage2.zip
    • rgb_depth_large.mat: contains aligned RGB and depth data, a total of 230766 pairs
    • inst_sub_large.mat: contains aligned instruction and sub-instruction data, a total of 157877 pairs
    • rgb_depth.mat: contains a small debug version
    • inst_sub.mat: contains a small debug version
  • stage3.zip
    • data.mat: contains aligned (RGB, depth, instruction, sub-instruction), a total of 601038 tuples

The data source includes:

V2

The v2 version contains a total of 83.9M samples of all modalities, which is a superset of v1. All data are stored in seperated files (RGB: JPEG, Depth: PNG, Instruction: TXT, Sub-instruction: TXT). Collecting and loading scripts are developed in the dev branch.

Additional data sources: ImageNet, LAION-HighResolution, CC-12M, C4, HM3D, SUN3D, ScanNet, Marky-gibson.

Access of several datasets are subject to specific terms and conditions (e.g., HM3D). Please request the access before using them.

Train, evaluate and test

run.py is the program entrance. You can run it like this:

python run.py \
  --exp-config {config} \
  --run-type {type}

{config} should be replaced by a config file path; {type} should be train, eval, or inference, meaning train, evaluate, and test models.

Our config files are stored in evoenc/config/:

File Meaning
evoenc.yaml Training model with behavior cloning
evoenc_da.yaml Training model with DAgger
evoenc_aug.yaml Training model with EnvDrop
evoenc_p{x}.yaml Evolutionary pre-training stage {x}+1
evoenc_p{x}_tune.yaml Task fine-tuning with DAgger

Several paths (like pre-training data folder and checkpoint paths) are configured by the above YAML files or the evoenc/config/default.py. Remember to change them as needed.

Pre-trained weights

[stage 1] [stage 2] [stage 3]

We release pre-trained encoder weights after evolutionary pre-training. We exclude the frozen pre-extractor in these weights to reduce the storage cost. Refer to the code evoenc/models/evoenc_policy.py to load pre-trained weights.

Visualization

Unified feature spaces

unified

Evolved encoder performance

unified

Comparison with the baseline:

navigation_vlnce.mp4

Failure cases

Premature stop

premature_stop.mp4

Wrong exploration

wrong_exploration.mp4

Deadlock

deadlock.mp4

Real scene navigation

Alkaid Robot

unified

Alkaid is a self-developed interactive service robot. Here are some parameters:

  • Camera: 720P resolution, 90° max FOV
  • Screen: 1080P, touch screen
  • Microphone: 4-microphone circular array, 61dB SNR
  • Speaker: 2 stereo units, 150Hz-20kHz output
  • Chassis: 2-wheel differential drive, 0.5m/s max speed, 1.2rad/s max angular speed

Demonstration

Currently, we release 13 paths with VLN-CE format. The video below demonstrates 4 paths.

navigation_vlntj.mp4

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published