This is the code for Crowd Scenes Captioning task based on xmodaler.
The original paper can be found here.
See installation instructions.
- Linux or macOS with Python >= 3.6
- PyTorch and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
- fvcore
- pytorch_transformers
- jsonlines
- pycocotools
See Getting Started with X-modaler
1 Introducion: Official introduction.
2 Feature: You can download our feature (npy file) here, including faster-rcnn, swin-transformer, hrnet. Please put it into
./open_source_dataset/crowdscenes_caption/features
3 Annotation: You can download here. Please put it into
./open_source_dataset/crowdscenes_caption
4 Evaluation: You can download here or use official evaluation code. Please put it into
./cococaption
Acess code:6826
Assume that you are under the root directory of this project, and you have activated your virtual environment if needed, and with crowdcaption dataset in 'open_source_dataset/crowdscenes_caption'. Here, we use 8GPUs.
# for xe training
bash train.sh
# for reward training
bash train_rl.sh
# for test
bash test.sh
Training and inference for other datasets in different config files are similar to the above description.
The performance and trained models will be released soon, please wait...
Thanks Xmodaler team for the wonderful open source project!
If you find the mmdetection-ref toolbox useful in your research, please consider citing:
@article{wang2022happens,
title={What Happens in Crowd Scenes: A New Dataset about Crowd Scenes for Image Captioning},
author={Wang, Lanxiao and Li, Hongliang and Hu, Wenzhe and Zhang, Xiaoliang and Qiu, Heqian and Meng, Fanma and Wu, Qingbo},
journal={IEEE Transactions on Multimedia},
year={2022},
publisher={IEEE}
}