This is the PyTorch implementation of our paper:
BabyWalk: Going Farther in Vision-and-Language Navigationby Taking Baby Steps
Wang Zhu*, Hexiang Hu*, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, Fei Sha
2020 Annual Conference of the Association for Computational Linguistics (ACL 2020)
Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk's generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better.
- Install Python 3.7 (Anaconda recommended: https://www.anaconda.com/distribution/).
- Install PyTorch following the instructions on https://pytorch.org/ (we used PyTorch 1.1.0 in our experiments).
- Download this repository or clone with Git, and then enter the root directory of the repository:
git clone https://github.com/Sha-Lab/babywalk
cd babywalk
- Check the installation of required packages in requirement.txt.
- Download and preprocess the data
chmod +x download.sh
./download.sh
After this step, check
simulator/resnet_feature/
should containResNet-152-imagenet.tsv
.simulator
should containtotal_adj_list.json
, which replace the Matterport3D simulatorsrc/vocab/vocab_data
should contain vocabulary and its glove embedding filestrain_vocab.txt
andtrain_glove.npy
.tasks/
should containR2R
,R4R
,R6R
,R8R
,R2T8
, each which a data folder in it containing training/evaluation data.
Updates: The old link for the ResNet feature is expired. Please see here for the new link and the additional landmark alignment code.
Here we take training on R2R as an example, using BABYWALK.
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
--split_postfix "_landmark" \
--task_name R2R \
--n_iters 50000 \
--model_name "follower_bbw" \
--il_mode "landmark_split" \
--one_by_one \
--one_by_one_mode "landmark" \
--history \
--log_every 100
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
--split_postfix "_landmark" \
--task_name R2R \
--n_iters 30000 \
--curriculum_iters 5000 \
--model_name "follower_bbw_crl" \
--one_by_one \
--one_by_one_mode "landmark" \
--history \
--log_every 100 \
--reward \
--reward_type "cls" \
--batch_size 64 \
--curriculum_rl \
--max_curriculum 4 \
--no_speaker \
--follower_prefix "tasks/R2R/follower/snapshots/follower_bbw_sample_train_iter_30000"
Here we take training on R2R as an example, using Speaker-Follower and Reinforced Cross-modal Matching.
- Speaker-Follower
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
--task_name R2R \
--n_iters 50000 \
--model_name "follower_sf_aug" \
--add_augment
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
--task_name R2R \
--n_iters 20000 \
--model_name "follower_sf" \
--follower_prefix "tasks/R2R/follower/snapshots/best_model"
- Reinforced Cross-modal Matching
CUDA_VISIBLE_DEVICES=0 python src/train_follower.py \
--task_name R2R \
--n_iters 20000 \
--model_name "follower_rcm_cls" \
--reward \
--reward_type "cls" \
--batch_size 64 \
--no_speaker \
--follower_prefix "tasks/R2R/follower/snapshots/follower_sf_aug_sample_train-literal_speaker_data_augmentation_iter_50000"
Here we take model trained on R2R, using BABYWALK as an example.
- Evaluate on the validation unseen data of Room 2-to-8.
CUDA_VISIBLE_DEVICES=0 python src/val_follower.py \
--task_name R2T8 \
--split_postfix "_landmark" \
--one_by_one \
--one_by_one_mode "landmark" \
--model_name "follower_bbw" \
--history \
--follower_prefix "tasks/R2R/follower/snapshots/best_model"
- Evaluate on the validation seen / unseen data of RxR (x=2,4,6,8).
- change
--task_name R2T8
to--task_name RxR
- change
- Evaluate on the test data of R2R.
- set
--task_name R2R
- add
--use test
- set
- For SF/RCM models, evaluate on RxR (x=2,4,6,8).
- set
--task_name RxR
- set
--max_steps 5*x
and--max_ins_len 50*x
- set
chmod +x download_model.sh
./download_model.sh
Models trained on R4R
Model | Eval R2R | Eval R4R | Eval R6R | Eval R8R |
---|---|---|---|---|
SF | 14.8 | 9.2 | 5.2 | 5.0 |
RCM(FIDELITY) | 18.3 | 13.7 | 7.9 | 6.1 |
REGRETFUL | 13.4 | 13.5 | 7.5 | 5.6 |
FAST | 14.2 | 15.5 | 7.7 | 6.3 |
BABYWALK | 27.8 | 17.3 | 13.1 | 11.5 |
BABYWALK(COGROUND) | 31.6 | 20.0 | 15.9 | 13.9 |
Models trained on R2R
Model | Eval R2R | Eval R4R | Eval R6R | Eval R8R |
---|---|---|---|---|
SF | 27.2 | 6.7 | 7.2 | 3.8 |
RCM(FIDELITY) | 34.4 | 7.2 | 8.4 | 4.3 |
REGRETFUL | 40.6 | 9.8 | 6.8 | 2.4 |
FAST | 45.4 | 7.2 | 8.5 | 2.4 |
BABYWALK | 36.9 | 13.8 | 11.2 | 9.8 |
Please citing the follow BibTex entry if you are using any content from this repository:
@inproceedings{zhu2020babywalk,
title = "{B}aby{W}alk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps",
author = "Zhu, Wang and Hu, Hexiang and Chen, Jiacheng and Deng, Zhiwei and Jain, Vihan and Ie, Eugene and Sha, Fei",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
year = "2020",
publisher = "Association for Computational Linguistics",
pages = "2539--2556",
}