Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

This repo provides the source code & data of our paper: Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models. Checkpoints can be found in here.

OHD-Caps benchmark

Our constructed OHD-Caps dataset is designed to assess whether models can accurately distinguish between correct texts and texts containing hallucinations in comparative environments. We select 500 samples from each of the COCO, Flickr30k, and Nocaps datasets, and generated 27 negative samples for each which include hallucinations of various types and degrees, utilizing the script utils/generate_negative_samples.py.

The benchmark can be found in data/OHD-Caps/test. Files with "annotations" in the filename contain more detailed information, including not only the negative samples but also the named objects that are inserted. Files without "annotations" are used only for format conversion in preparation for evaluation. The file format in annotation files is as follows:

{
    "file_path": "file_path",
    "positive_sample": "caption",
    "adversarial_samples": {
      "object1": "caption1"
      ...
    },
    "popular_samples": {
      "object2": "caption2"
      ...
    },
    "random_samples": {
      "object3": "caption3"
      ...
    },
    "delete_samples": {
        "object4": "caption4"
        ...
      }
}

Finetune

We provide the code for finetuning the CLIP model using in finetune.py. The script can be run with the following command:

cd finetune_clip
./run.sh $gpu_id

where $gpu_id is the GPU id to use. To fine-tune our model, the data we utilized consists of two parts. The first part is a dataset for compositional understanding tasks, which is obtained by running the utils/generate_composition_data.py script. The second part is a specially constructed dataset for hallucination detection, containing 16k samples, generated by the utils/generate_negative_samples.py script. Specifically, we extracted 8k samples each from the COCO training set and the Flickr dataset. All the data used for fine-tuning are stored in the data/OHD-Caps/train directory.

Evaluation

The evaluation code of clip is adapted from https://github.com/amitakamath/whatsup_vlms. We add the evaluation code for hallucination detection and extends the code for huggingface models. The script can be run with the following command:

cd evaluate_clip
python main_aro.py --model-name $model_name --dataset $dataset_name

where $model_name is the model name and $dataset_name is the dataset name. The hallucination detection datasets in OHD-Caps benchmark are named as COCO_Object, Flickr30k_Object, Nocaps_Object.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
data		data
evaluate_clip		evaluate_clip
evaluate_llava		evaluate_llava
finetune_clip		finetune_clip
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

OHD-Caps benchmark

Finetune

Evaluation

About

Releases

Packages

Languages

Yufang-Liu/clip_hallucination

Folders and files

Latest commit

History

Repository files navigation

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

OHD-Caps benchmark

Finetune

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages