EgoCap is a first sizable dataset that supports end-to-end egocentric image captioning. It contain 2.1K egocentric images, over 10K captions, and 6.3K contextual label.
The EgoCap dataset can be downloaded HERE. We also provide CLIP visual and textual features extracted with OpenAI-CLIP-Feature. If you have questions or notice any issues, please contact [email protected].
EgoFormer is a two-stream transformer based deep neural network utilizing visual-contextual attention for image caption generation in a 1st-person narrative. EgoFormer accomplishes accurate and human-alike scene understanding with the aid of context encoding. The context encoder is a pre-trained ViT encoder, which is subsequently fine-tuned on EgoCap context classfication, namely where, when, and whom.
Please cite our paper as belew;
@article{DaiEgoCap2024,
title = {EgoCap and EgoFormer: First-person image captioning with context fusion},
journal = {Pattern Recognition Letters},
volume = {181},
pages = {50-56},
year = {2024},
issn = {0167-8655},
doi = {https://doi.org/10.1016/j.patrec.2024.03.012},
url = {https://www.sciencedirect.com/science/article/pii/S0167865524000801},
author = {Zhuangzhuang Dai and Vu Tran and Andrew Markham and Niki Trigoni and M. Arif Rahman and L.N.S. Wijayasingha and John Stankovic and Chen Li},
keywords = {Image captioning, Storytelling, Dataset},
}
- Python 3.7
- Pytorch 1.7
- torchvision 0.8
- transformers
- pycocoevalcap
- aac-metrics
- sklearn
Microsoft COCO-2017 dataset and EgoCap dataset are required. After acquiring these datasets locally, specify source directory links, training settings, and hyperparameters in configuration.py.
This repository implements the training and evaluation of EgoFormer. It is modified based on repository CATR.
Make sure you train the baseline with COCO first (configure self.modality as "image"). Then use the following command for context learning;
python3 vit_pretrain.py # Pre-train ViT context encoder
Finally use the following command to train EgoFormer with updated link to converged context ViT model;
python3 main.py
You can run EgoFormer on an embedded device to let robot explain the scene for you. We implemented an EgoBot on NVIDIA Jetson Nano with a CSI camera, a speaker, Wi-Fi dongle, and a power bank. Please check out our Video Demo.
To deploy the EgoFormer inference engine onto Jetson Nano, you will need soundcard, audio2numpy, and gtts packages to run the camera driver, inference engine, and the speaker. To let EgoBot speak out loud, run
python3 egobot_shoot.py &
python3 egobot_speak.py
It is recommended to perform evaluation through the Inference_and_Analysis.ipynb. Otherwise, use predict_qualitative() in Eval.py to generate caption of an image, or conduct quantitative analysis on a directory of images.
Some EgoFormer caption results in comparison to baseline transformer are shown below
This repository is available under the MIT License. Contributions are not accepted. If you notice issues, feel free to raise them or email the authors.
We thank the support of the National Institute of Standards and Technology (NIST) in project: Pervasive, Accurate, and Reliable Location-based Services for Emergency Responders.
Experiments were run on Aston EPS Machine Learning Server, funded by the EPSRC Core Equipment Fund, Grant EP/V036106/1.
We thank Professor Bongjun Choi's team of Dongseo University for helping with data validation.