[CoRL24] Contrastive λ-Repformer: Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representation

Accepted at CoRL 2024
Project Page
arXiv
Dataset
Project Video

Abstract

In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive λ-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive λ-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Instructions

We assume the following environment for our experiments:

Python 3.8.10
PyTorch version 2.1.0 with CUDA 11.7 support

Clone & Install

cd contrastive-lambda-repformer

pyenv virtualenv 3.8.10 contrastive-lambda-repformer
pyenv local contrastive-lambda-repformer
pip install -r requirements.txt

Datasets

Our dataset can be downloaded at this link.
- Unzip and extract the data using the following commands.

mkdir data
cd data
wget https://contrastive-lambda-repformer.s3.amazonaws.com/dataset/dataset.tar.gz
tar -xvzf dataset.tar.gz
cd ..

Extract Features

python src/utils/extract_ViT_features.py
python src/utils/extract_DINOv2_features.py
python src/utils/retrieve_InstructBLIP_narratives.py

Train & Evaluation on SP-RT-1 Dataset

Please note that the first run of the program may take some time to complete, as the remaining features are extracted.
Previously saved features are used in subsequent runs for faster execution.

export PYTHONPATH=`pwd`
export OPENAI_API_KEY="Your OpenAI API Key"
python src/main.py

Evaluation on SP-RT-1 Dataset

If you already have a valid checkpoint, you can evaluate the model on the dataset.

Set checkpoint_path in configs/config.json to the relative path of your checkpoint file.
Execute the following commands.

export PYTHONPATH=`pwd`
export OPENAI_API_KEY="Your OpenAI API Key"
python src/test_model.py

Evaluation on Other Datasets (SP-HSR dataset)

THe model can also be evaluated on other datasets, such as the SP-RT-1 dataset. Using checkpoints at around 50 epochs is recommended.

Set dataset_name in configs/config.json to "SP-HSR".
Run the commands in "Extract Features".
Run the commands in "Evaluation on SP-RT-1 Dataset".

Bibtex

@inproceedings{
    goko2024task,
    title     = {{Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations}},
    author    = {Goko, Miyu and Kambara, Motonari and Saito, Daichi and Otsuki, Seitaro and Sugiura, Komei},
    booktitle = {8th Annual Conference on Robot Learning},
    year      = {2024}
}

License

This work in licensed under the MIT license. To view a copy of this license, see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CoRL24] Contrastive λ-Repformer: Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representation

Abstract

Instructions

Clone & Install

Datasets

Extract Features

Train & Evaluation on SP-RT-1 Dataset

Evaluation on SP-RT-1 Dataset

Evaluation on Other Datasets (SP-HSR dataset)

Bibtex

License

About

Releases

Packages

Contributors 3

Languages

License

keio-smilab24/contrastive-lambda-repformer

Folders and files

Latest commit

History

Repository files navigation

[CoRL24] Contrastive λ-Repformer: Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representation

Abstract

Instructions

Clone & Install

Datasets

Extract Features

Train & Evaluation on SP-RT-1 Dataset

Evaluation on SP-RT-1 Dataset

Evaluation on Other Datasets (SP-HSR dataset)

Bibtex

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages