Factor Graph Attention

A general multimodal attention approach inspired by probabilistic graphical models.
Achieves a state-of-the-art performance (MRR) on visual dialog task.

This repository is the official implementation of Factor Graph Attention. (Appeared in CVPR'19)

Part of 2020 visual dialog challenge winning submission (https://github.com/idansc/mrr-ndcg)

Use cases of FGA:

Video dialog, spatial interactions between frames, can be found here (https://github.com/idansc/simple-avsd)
Spatial navigation, can be found here (https://github.com/barmayo/spatial_attention)
Video retrieval, between text query and clips, can be found here (https://github.com/AmeenAli/VideoMatch)

Requirements

The model can easily run on a single GPU :)

To install requirements:

conda env create -f fga.yml

follows with:

 conda activate fga

Preprocessed data:

Add the following files under data dir:

visdial_params.json

visdial_data.h5

Pretrained features:

VGG A grid image features based on the VGG model pretrained on ImageNet (Faster). Note, the h5 databases has slightly different dataset keys, therfore the code needs to be adapted accordingly.
F-RCNN based on object detector with ResNetx101 backbone, 37 proposals, fine-tuned on Visual Genome. Achives SOTA. The file includes boxes and classes information.

Note: You can use CurlWget to easily download the features on your server.

See the original paper for performance differences. I recommand using the FRCNN features, mainly because it is finetuned on the relevant VisualGenome dataset.

Training

To train the model in the paper, run this command:

python train.py --batch-size  128 \
             --image_data "data/frcnn_features_new" \
             --test-batch-size 64 \
             --epochs 10 \
             --lr 1e-3 \
             --opt 0 \
             --folder-prefix "baseline" \
             --mode "FGA" \
             --initialization "he" \
             --lstm-initialization "he" \
             --log-interval 3000 \
             --test-after-every 1 \
             --word-embed-dim 200 \
             --hidden-ans-dim 512 \
             --hidden-hist-dim 128 \
             --hidden-cap-dim 128 \
             --hidden-ques-dim 512 \
             --seed 0

Evaluation

To evaluate on the val split, provide a path using the model-pathname arg. The path should contain a model file, best_model_mrr.pth.tar.

Call example:

python train.py --batch-size  128 \
             --image_data "data/frcnn_features_new" \
             --test-batch-size 64 \
             --epochs 10 \
             --lr 1e-3 \
             --opt 0 \
             --only_val T \
             --model-pathname "models/baseline" \
             --folder-prefix "baseline" \
             --mode "FGA" \
             --initialization "he" \
             --lstm-initialization "he" \
             --log-interval 3000 \
             --test-after-every 1 \
             --word-embed-dim 200 \
             --hidden-ans-dim 512 \
             --hidden-hist-dim 128 \
             --hidden-cap-dim 128 \
             --hidden-ques-dim 512 \
             --seed 0

If you wish to create a test submission file (can be submitted to the challenge servers @ EvalAI) replace only_val, with submission arg, i.e.:

python train.py --batch-size  128 \
             --image_data "data/frcnn_features_new" \
             --test-batch-size 64 \
             --epochs 10 \
             --lr 1e-3 \
             --opt 0 \
             --submission T \
             --model-pathname "models/baseline" \
             --folder-prefix "baseline" \
             --mode "FGA" \
             --initialization "he" \
             --lstm-initialization "he" \
             --log-interval 3000 \
             --test-after-every 1 \
             --word-embed-dim 200 \
             --hidden-ans-dim 512 \
             --hidden-hist-dim 128 \
             --hidden-cap-dim 128 \
             --hidden-ques-dim 512 \
             --seed 0

Pre-trained Models

You can download pertained models here:

best_model_mrr.pth.tar trained on VisDial1.0 using F-RCNN features

Results

Evaluation is done on VisDialv1.0.

Short description:

VisDial v1.0 contains 1 dialog with 10 question-answer pairs (starting from an image caption) on ~130k images from COCO-trainval and Flickr, totalling ~1.3 million question-answer pairs.

Our model achieves the following performance on the validation set, and similar results on test-std/test-challenge.

Model name	R@1	MRR
FGA	53%	66
FGA	53%	66
5×FGA	56%	69

Note, the paper results may slightly vary from the results of this repo, since it is a refactored version. For the legacy version, please contact via email

Contributing

Please cite Factor Graph Attention if you use this work in your research:

@inproceedings{schwartz2019factor,
  title={Factor graph attention},
  author={Schwartz, Idan and Yu, Seunghak and Hazan, Tamir and Schwing, Alexander G},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={2039--2048},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.idea		.idea
imgs		imgs
README.md		README.md
args.py		args.py
atten.py		atten.py
fga.yml		fga.yml
fga_model.py		fga_model.py
image_dataset.py		image_dataset.py
loader.py		loader.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Factor Graph Attention

Requirements

Preprocessed data:

Training

Evaluation

Pre-trained Models

Results

Contributing

About

Releases

Packages

Languages

idansc/fga

Folders and files

Latest commit

History

Repository files navigation

Factor Graph Attention

Requirements

Preprocessed data:

Training

Evaluation

Pre-trained Models

Results

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages