Skip to content

[EMNLP 2024] Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Notifications You must be signed in to change notification settings

Yufang-Liu/clip_hallucination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

This repo provides the source code & data of our paper: Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models. Checkpoints can be found in here.

image-overview

OHD-Caps benchmark

Our constructed OHD-Caps dataset is designed to assess whether models can accurately distinguish between correct texts and texts containing hallucinations in comparative environments. We select 500 samples from each of the COCO, Flickr30k, and Nocaps datasets, and generated 27 negative samples for each which include hallucinations of various types and degrees, utilizing the script utils/generate_negative_samples.py.

The benchmark can be found in data/OHD-Caps/test. Files with "annotations" in the filename contain more detailed information, including not only the negative samples but also the named objects that are inserted. Files without "annotations" are used only for format conversion in preparation for evaluation. The file format in annotation files is as follows:

{
    "file_path": "file_path",
    "positive_sample": "caption",
    "adversarial_samples": {
      "object1": "caption1"
      ...
    },
    "popular_samples": {
      "object2": "caption2"
      ...
    },
    "random_samples": {
      "object3": "caption3"
      ...
    },
    "delete_samples": {
        "object4": "caption4"
        ...
      }
}

Finetune

We provide the code for finetuning the CLIP model using in finetune.py. The script can be run with the following command:

cd finetune_clip
./run.sh $gpu_id

where $gpu_id is the GPU id to use. To fine-tune our model, the data we utilized consists of two parts. The first part is a dataset for compositional understanding tasks, which is obtained by running the utils/generate_composition_data.py script. The second part is a specially constructed dataset for hallucination detection, containing 16k samples, generated by the utils/generate_negative_samples.py script. Specifically, we extracted 8k samples each from the COCO training set and the Flickr dataset. All the data used for fine-tuning are stored in the data/OHD-Caps/train directory.

Evaluation

The evaluation code of clip is adapted from https://github.com/amitakamath/whatsup_vlms. We add the evaluation code for hallucination detection and extends the code for huggingface models. The script can be run with the following command:

cd evaluate_clip
python main_aro.py --model-name $model_name --dataset $dataset_name

where $model_name is the model name and $dataset_name is the dataset name. The hallucination detection datasets in OHD-Caps benchmark are named as COCO_Object, Flickr30k_Object, Nocaps_Object.

About

[EMNLP 2024] Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages