This folder contains the implementation of the InternVL-Chat.
See INSTALLATION.md
In addition, using this codebase requires executing the following steps:
-
Install other requirements:
pip install --upgrade pip # enable PEP 660 support pip install -e .
model name | type | download | size |
---|---|---|---|
InternViT-300M-448px | ViT | 🤗 HF link | 0.6 GB |
InternViT-6B-448px-V1-2 | ViT | 🤗 HF link | 11.1 GB |
InternViT-6B-448px-V1-5 | ViT | 🤗 HF link | 11.1 GB |
Nous-Hermes-2-Yi-34B | LLM | 🤗 HF link | 65.0 GB |
Please download the above model weights and place them in the pretrained/
folder.
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-300M-448px --local-dir InternViT-300M-448px
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-6B-448px-V1-2 --local-dir InternViT-6B-448px-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternViT-6B-448px-V1-5 --local-dir InternViT-6B-448px-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False NousResearch/Nous-Hermes-2-Yi-34B --local-dir Nous-Hermes-2-Yi-34B
The directory structure is:
pretrained
│── InternViT-300M-448px/
│── InternViT-6B-448px-V1-2/
│── InternViT-6B-448px-V1-5/
└── Nous-Hermes-2-Yi-34B/
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon ShareGPT-4V and additionally integrate LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with LLaVA-NeXT.
First, download the annotation files and place them in the playground/opensource/
folder.
Second, download all the images we used.
- AI2D: ai2d_images (provided by InternLM-XComposer)
- ChartQA: ChartQA Dataset
- COCO: train2017
- DocVQA: train, val, test
- DVQA: images
- GQA: images
- LLaVA-Pretrain: images
- OCR-VQA: download script. We save all files as
.jpg
- SAM: We only use 000000~000050.tar for now. You can quickly download 9K images from here.
- TextVQA: trainvalimages
- SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted images.
- VisualGenome: part1, part2
- WebData: images. Only for academic usage.
- GeoQA+: GeoQA+ images
Then, organize the data as follows in playground/data
:
playground/
├── opensource
│ ├── ai2d_train_12k.jsonl
│ ├── chartqa_train_18k.jsonl
│ ├── docvqa_train_10k.jsonl
│ ├── dvqa_train_200k.jsonl
│ ├── geoqa+.jsonl
│ ├── llava_instruct_150k_zh.jsonl
│ ├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
│ ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
│ └── synthdog_en.jsonl
├── data
│ ├── ai2d
│ │ ├── abc_images
│ │ └── images
│ ├── chartqa
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── coco
│ │ └── train2017
│ ├── docvqa
│ │ ├── test
│ │ ├── train
│ │ └── val
│ ├── dvqa
│ │ └── images
│ ├── gqa
│ │ └── images
│ ├── llava
│ │ └── llava_pretrain
│ │ └── images
│ ├── ocr_vqa
│ │ └── images
│ ├── sam
│ │ └── images
│ ├── share_textvqa
│ │ └── images
│ ├── synthdog-en
│ │ └── images
│ ├── textvqa
│ │ └── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ └── VG_100K_2
│ ├── web-celebrity
│ │ └── images
│ ├── web-landmark
│ │ └── images
│ ├── wikiart
│ │ └── images
│ ├── geoqa+
│ │ └── images
We provide slurm scripts for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
- If you encounter an OOM error, you can decrease the
PER_DEVICE_BATCH_SIZE
, for example, setPER_DEVICE_BATCH_SIZE=4
.
# using 32 GPUs
PARTITION='your partition' GPUS=32 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh
# using 64 GPUs
PARTITION='your partition' GPUS=64 PER_DEVICE_BATCH_SIZE=8 sh shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh
The hyperparameters used for fine-tuning are listed in the following table. And, you can view the training logs in tensorboard at here.
Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|---|
InternVL-Chat-V1-2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
See this document to finetune InternVL-Chat-V1-2.
OCR-related Benchmarks
Note: TextVQA contains two scores, representing not using or using Rosetta OCR tokens, respectively.
model | #param | DocVQA (val/test) |
ChartVQA (avg. test) |
InfoVQA (val/test) |
TextVQA (val, wo/w OCR) |
OCRBench | AI2D |
---|---|---|---|---|---|---|---|
InternVL‑Chat‑V1‑1 | 19B | 47.6 / 48.1 | 59.9 | 33.3 / 32.0 | 64.2 / 68.6 | 530 | 72.4 |
InternVL‑Chat‑V1‑2 | 40B | 56.4 / 57.7 | 68.0 | 36.0 / 39.5 | 67.5 / 72.5 | 569 | 79.0 |
InternVL‑Chat‑V1‑2‑Plus | 40B | 56.9 / 56.8 | 72.8 | 40.9 / 40.6 | 71.2 / 74.1 | 598 | 78.9 |
InternVL‑Chat‑V1‑5 | 26B | 90.5 / 90.8 | 83.8 | 72.4 / 72.5 | 80.6 / - | 724 | 80.7 |
MultiModal Benchmark
model | #param | MME | MMB (dev/test) |
MMB‑CN (dev/test) |
CCBench | MMVet | MMMU (val/test) |
MathVista (testmini) |
Hallusion Bench |
RealWorld QA |
SEEDv1 (image) |
CMMMU (val/test) |
POPE | MMVP | Tiny LVLM | LLaVA Wild |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
InternVL‑Chat‑V1‑1 | 19B | 1659.8 / 361.4 | 76.7 / 75.4 | 71.9 / 70.3 | 43.3 | 46.7 | 39.1 / 35.3 | 34.5 | 36.1 | 58.0 | 73.2 | 34.8 / 34.0 | 87.1 | 44.7 | 343.2 | 73.2 |
InternVL‑Chat‑V1‑2 | 40B | 1686.8 / 488.6 | 81.4 / 82.2 | 79.5 / 81.2 | 58.6 | 48.9 | 51.6 / 46.2 | 47.7 | 47.6 | 67.5 | 75.6 | - | 88.0 | 56.7 | 350.3 | 85.0 |
InternVL‑Chat‑V1‑2‑Plus | 40B | 1625.2 / 552.9 | 83.4 / 83.8 | 81.6 / 82.0 | 55.9 | 47.9 | 50.3 / 45.6 | 59.9 | 47.4 | 67.8 | 76.4 | - | 88.7 | 58.7 | 353.9 | 84.6 |
InternVL‑Chat‑V1‑5 | 26B | 1637.8 / 550.0 | - / 82.2 | - / 82.0 | 70.0 | 62.8 | 45.2 / - | 53.5 | 49.3 | 66.0 | 76.0 | - | 88.3 | 57.3 | 356.8 | 94.7 |
Visual Question Answering & Image Captioning
model | #param | OKVQA (val) |
VizWiz (val/test) |
GQA (test) |
SQA (image) |
VQAv2 (testdev) |
COCO (test) |
Flickr30K (test) |
NoCaps (val) |
---|---|---|---|---|---|---|---|---|---|
InternVL‑Chat‑V1‑1 | 19B | 64.1 | 59.0 / 57.3 | 62.5 | 90.1 | 80.9 | 142.2 | 84.8 | 120.8 |
InternVL‑Chat‑V1‑2 | 40B | 62.5 | 61.9 / 60.0 | 64.0 | 83.3 | - | 113.9 | 92.9 | 112.5 |
InternVL‑Chat‑V1‑2‑Plus | 40B | 67.6 | 61.3 / 59.5 | 66.9 | 98.1 | - | 143.4 | 89.5 | 125.8 |
InternVL‑Chat‑V1‑5 | 26B | 62.0 | 63.5 / - | 65.7 | 94.0 | - | 98.4 | 81.2 | 99.6 |
Visual Grounding
model | #param | RefCOCO (val) |
RefCOCO (testA) |
RefCOCO (testB) |
RefCOCO+ (val) |
RefCOCO+ (testA) |
RefCOCO+ (testB) |
RefCOCO‑g (val) |
RefCOCO‑g (test) |
---|---|---|---|---|---|---|---|---|---|
InternVL‑Chat‑V1‑1 | 19B | 84.7 | 89.9 | 78.6 | 78.5 | 85.6 | 70.1 | 81.0 | 81.4 |
InternVL‑Chat‑V1‑2 | 40B | 74.4 | 80.3 | 66.5 | 70.7 | 77.6 | 62.0 | 69.2 | 70.0 |
InternVL‑Chat‑V1‑2‑Plus | 40B | 90.2 | 93.4 | 85.5 | 85.3 | 90.4 | 79.7 | 88.5 | 88.8 |
InternVL‑Chat‑V1‑5 | 26B | 91.4 | 93.7 | 87.1 | 87.0 | 92.3 | 80.9 | 88.5 | 89.3 |
model | QLLaMA | LLM | res | COCO | Flickr | NoCaps | VQAv2 | GQA | VizWiz | TextVQA | MME | POPE | Download |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
InternVL‑Chat | ✔️ | frozen V‑7B | 224 | 141.4 | 89.7 | 120.5 | 72.3 | 57.7 | 44.5 | 42.1 | 1298.5 | 85.2 | TODO |
InternVL‑Chat | ✔️ | frozen V‑13B | 224 | 142.4 | 89.9 | 123.1 | 71.7 | 59.5 | 54.0 | 49.1 | 1317.2 | 85.4 | TODO |
InternVL‑Chat | ✔️ | V‑13B | 336 | 146.2 | 92.2 | 126.2 | 81.2 | 66.6 | 58.5 | 61.5 | 1586.4 | 87.6 | TODO |
COCO images are used in VQAv2/OK-VQA/RefCOCO/RefCOCO+/RefCOCOg. Make sure you have already downloaded COCO images before evaluating on these benchmarks.
Data Preparation
mkdir -p data/coco && cd data/coco
# download coco images
wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip
mkdir -p annotations && cd annotations/
# download converted annotation files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.json
cd ../../../
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-coco [--dynamic]
Data Preparation
mkdir -p data/flickr30k && cd data/flickr30k
# download images from https://bryanplummer.com/Flickr30kEntities/
# karpathy split annotations can be downloaded from the following link:
# https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_test_karpathy.txt
# this file is provided by the clip-benchmark repository.
# We convert this txt file to json format, download the converted file:
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-flickr30k [--dynamic]
Data Preparation
mkdir -p data/nocaps && cd data/nocaps
# download images from https://nocaps.org/download
# original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> caption-nocaps [--dynamic]
Data Preparation
mkdir -p data/vqav2 && cd data/vqav2
# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./
ln -s ../coco/test2015 ./
# download questions and annotations
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip && unzip v2_Annotations_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip && unzip v2_Questions_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip && unzip v2_Questions_Test_mscoco.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_testdev.jsonl
cd ../..
Evaluation
# VQAv2-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-val [--dynamic]
# VQAv2-testdev
GPUS=8 sh evaluate.sh <checkpoint> vqa-vqav2-testdev [--dynamic]
For the testdev set, submit the results to the evaluation server.
Data Preparation
mkdir -p data/okvqa && cd data/okvqa
# make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./
# download annotations and questions
wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip && unzip mscoco_train2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip && unzip OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip && unzip mscoco_val2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip && unzip OpenEnded_mscoco_val2014_questions.json.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_val.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-okvqa-val [--dynamic]
Data Preparation
mkdir -p data/textvqa && cd data/textvqa
# download images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip && unzip train_val_images.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/textvqa_val_questions.json
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/textvqa_val.jsonl
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/textvqa_val_llava.jsonl
cd ../..
Evaluation
# without ocr tokens
GPUS=8 sh evaluate.sh <checkpoint> vqa-textvqa-val [--dynamic]
# with ocr tokens (hint: LLaVA use ocr tokens)
GPUS=8 sh evaluate.sh <checkpoint> vqa-textvqa-val-ocr [--dynamic]
Data Preparation
mkdir -p data/vizwiz && cd data/vizwiz
# download images
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip && unzip train.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip && unzip val.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip && unzip test.zip
# download annotations
wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip && unzip Annotations.zip
# download converted files
# train
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_train.jsonl
# val
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_val.jsonl
# test
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_test.jsonl
cd ../..
Evaluation
# VizWiz val
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-val [--dynamic]
# VizWiz test
GPUS=8 sh evaluate.sh <checkpoint> vqa-vizwiz-test [--dynamic]
For the test set, submit the results to the evaluation server.
Data Preparation
mkdir -p data/docvqa && cd data/docvqa
# download images and annotations
wget https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz --no-check-certificate # (optional)
wget https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz --no-check-certificate
wget https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz --no-check-certificate
# unzip files
tar -zxvf train.tar.gz
tar -zxvf val.tar.gz
tar -zxvf test.tar.gz
# download converted jsonl files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl
cd ../..
Evaluation
# DocVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-val [--dynamic]
# DocVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-docvqa-test [--dynamic]
For the test set, submit the results to the evaluation server.
Data Preparation
mkdir -p data/chartqa && cd data/chartqa
# download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl
cd ../..
Evaluation
# test both ChartQA-test-human & ChartQA-test-augmented
GPUS=8 sh evaluate.sh <checkpoint> vqa-chartqa-test [--dynamic]
Data Preparation
mkdir -p data/gqa && cd data/gqa
# download images
wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
unzip images.zip
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_balanced.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_balanced.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_gqa_testdev_balanced_qwen_format.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-gqa-testdev [--dynamic]
Data Preparation
mkdir -p data/ocrvqa && cd data/ocrvqa
# download images by following instructions at https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_test.jsonl
cd ../..
Evaluation
# OCRVQA-val
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-val [--dynamic]
# OCRVQA-test
GPUS=8 sh evaluate.sh <checkpoint> vqa-ocrvqa-test [--dynamic]
Data Preparation
mkdir -p data/ai2diagram && cd data/ai2diagram
# download converted files
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/ai2d_test_vlmevalkit.jsonl -O test_vlmevalkit.jsonl
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/AI2D_TEST.zip && unzip AI2D_TEST.zip
# download images from Google drive (optional, provided by InternLM-XComposer)
# https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing
# images should be placed in `data/ai2diagram/ai2d/abc_images` and `data/ai2diagram/ai2d/images`
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> vqa-ai2d-test [--dynamic]
Data Preparation
mkdir -p data/scienceqa/images && cd data/scienceqa/images
# download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip
cd ..
# download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> scienceqa [--dynamic]
Data Preparation
mkdir -p data/refcoco && cd data/refcoco
# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/refcoco_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/refcoco%2B_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/refcocog_test.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> refcoco [--dynamic]
Data Preparation
mkdir -p data/mme && cd data/mme
# 1. Download the data following the official instructions [here](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation).
# 2. Downloaded images to `MME_Benchmark_release_version`.
cd ../..
Evaluation
# single GPU testing
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> mme
Data Preparation
mkdir -p data/mmbench && cd data/mmbench
# download csv files of mmbench
wget http://opencompass.openxlab.space/utils/MMBench/CCBench_legacy.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv
cd ../..
Evaluation
# mmbench_dev_20230712
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-en [--dynamic]
# mmbench_dev_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-dev-cn [--dynamic]
# mmbench_test_en_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-en [--dynamic]
# mmbench_test_cn_20231003
GPUS=8 sh evaluate.sh <checkpoint> mmbench-test-cn [--dynamic]
# ccbench_dev
GPUS=8 sh evaluate.sh <checkpoint> ccbench-dev [--dynamic]
Then, submit the results to the evaluation server.
Data Preparation
mkdir -p data/pope && cd data/pope
# make sure you have downloaded COCO images
ln -s ../coco/val2014 ./
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_pope_test.jsonl
# download `coco` from POPE
mkdir -p coco && cd coco
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_adversarial.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_popular.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_random.json
cd ../../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> pope [--dynamic]
Data Preparation
The evaluation code will automatically download the dataset from hugging face.
Evaluation
# dev set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-dev [--dynamic]
# val set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-val [--dynamic]
# test set
GPUS=8 sh evaluate.sh <checkpoint> mmmu-test [--dynamic]
For the test set, submit the results to the evaluation server.
Data Preparation
mkdir -p data/tiny_lvlm && cd data/tiny_lvlm
# download dataset from https://github.com/OpenGVLab/Multi-Modality-Arena/tree/main/tiny_lvlm_evaluation
# i.e., download `updated_datasets.tar.gz` from https://drive.google.com/file/d/1PuFC612XzOmKwzRldtBb1CFZnIjiR7we/view
tar -xzvf updated_datasets.tar.gz
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> tiny_lvlm [--dynamic]
Data Preparation
cd data/
# download dataset from https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
cd llava-bench-in-the-wild/
rm -rf images && mkdir -p images && cd images
# download all 24 images
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/001.jpg
# ...
wget https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/024.jpg
cd ../../../
Evaluation
# single GPU testing
export OPENAI_API_KEY='your_gpt4_key'
CUDA_VISIBLE_DEVICES=0 sh evaluate.sh <checkpoint> llava-bench
Data Preparation
mkdir -p data/mm-vet && cd data/mm-vet
wget https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip
unzip mm-vet.zip
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/llava-mm-vet.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> mmvet [--dynamic]
Data Preparation
cd data
git lfs install
git clone https://huggingface.co/datasets/MMVP/MMVP
git clone https://huggingface.co/datasets/MMVP/MMVP_VLM
cd ..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> mmvp [--dynamic]
Data Preparation
mkdir -p data/MathVista && cd data/MathVista
wget https://huggingface.co/datasets/AI4Math/MathVista/raw/main/annot_testmini.json
cd ../..
Evaluation
export OPENAI_API_KEY='your-openai-key'
# testmini set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-testmini [--dynamic]
# test set
GPUS=8 sh evaluate.sh <checkpoint> mathvista-test [--dynamic]
Data Preparation
mkdir -p data/SEED && cd data/SEED
# 1. Follow the official instructions [Data Preparation for SEED-Bench-1](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md#data-preparation-for-seed-bench-1)
# to download the images and the videos. Put images under `./data/SEED/SEED-Bench-image`.
# 2. Extract the video frame in the middle from the downloaded videos, and put them under `./data/SEED/SEED-Bench-image`.
# LLaVA provided the script [`extract_video_frames.py`](../internvl_chat/tools/extract_video_frames.py) modified from the official one.
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/seed.jsonl
cd ../..
Evaluation
GPUS=8 sh evaluate.sh <checkpoint> seed [--dynamic]