We present the All-Seeing Project V2 with:
All-Seeing Dataset V2 (AS-V2) dataset: we propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).
All-Seeing Model v2 (ASMv2): we develop ASMv2, which integrates the Relation Conversation ability while maintaining powerful general capabilities. It is endowed with grounding and referring capabilities, exhibiting state-of-the-art performance on region-level tasks. Furthermore, this model can be naturally adapted to the Scene Graph Generation task in an open-ended manner.
Circular-based Relation Probing Evaluation (CRPE) benchmark: We construct a benchmark called Circular-based Relation Probing Evaluation (CRPE), which is the first benchmark that covers all elements of the relation triplets (subject, predicate, object)
, providing a systematic platform for the evaluation of relation comprehension ability.
Figure 1: Overview and comparison of our All-Seeing Model v2 with other MLLMs.
AS-V2 comprises 127K high-quality relation conversation samples in total, including 42K in detailed descriptions, 63K in region captioning, and 22K in conversations (~90K turns in total), respectively.
In the relation conversation, all mentioned objects are linked to their corresponding regions within the image while the predicates are linked to the regions corresponding to their subjects and objects. Therefore, sentences in the relation conversation can be easily parsed into a scene graph. Some data examples are shown below.
More specifically, each sentence in relation conversation marks the object and predicate in the sentence using <ref></ref>
and <pred></pred>
, respectively.
Each marked object is followed by a bounding box, indicating its localization. Similarly, each predicate is followed by two bounding boxes, which specifically refer to the subjects and objects of the predicate.
All bounding boxes are normalized to integer values within the range
where
By utilizing the prediction of bounding boxes for each object (serving as semantic tags for nodes) and those for subjects and objects related to each predicate (serving as nodes, edges, and predicate labels), the generated ReC sentence can be naturally converted into a scene graph.
Some examples about the data formulation are shown below.
Please refer to our paper and this script for more details about the parsing of relation conversation.
CRPE is a benchmark designed to quantitatively evaluate the object recognition and relation comprehension ability of models. The evaluation is formulated as single-choice questions.
The benchmark consists of four splits: Existence, Subject, Predicate, and Object.
The Existence split evaluates the object recognition ability while the remaining splits are designed to evaluate the capability of relation comprehension, focusing on probing each of the elements in the relation triplets (subject, predicate, object)
separately.
Some data examples are shown below.
Additionally, to evaluate the dependency on language priors, we also include abnormal data in our evaluation. These images in these abnormal data depict relation triplets that are very rare in the real world.
Multimodal Dialogue (click to expand)
-
Results on 12 general visual-language benchmarks.
$^*$ The training images of the datasets are observed during training.Model VQAv2 GQA VisWiz SQA-IMG TextVQA POPE MME MMB $\text{MMB}_{CN}$ SEED LLaVA-Wild MM-Vet LLaVA-1.5-7B 78.5* 62.0* 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 58.6 63.4 30.5 LLaVA-1.5-13B 80.0* 63.3* 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 70.7 35.4 VILA-7B 79.9* 62.3* 57.8 68.2 64.4 85.5 1533.0 68.9 61.7 61.1 69.7 34.9 VILA-13B 80.8* 63.3* 60.6 73.7 66.6 84.2 1570.1 70.3 64.3 62.8 73.0 38.8 ASMv2 (ours) 81.0* 63.9* 58.1 87.1* 60.2 86.3 1621.0 74.4 64.3 66.3 78.9 41.3
Multimodal Dialogue with Pointer Instructions (click to expand)
-
Accuracy scores on the Referring Expression Comprehension task.
Model RefCOCO
ValRefCOCO
Test-ARefCOCO
Test-BRefCOCO+
ValRefCOCO+
Test-ARefCOCO+
Test-BRefCOCOg
ValRefCOCOg
TestAvg. Shikra-13B 87.83 91.11 81.81 82.89 87.79 74.41 84.64 83.16 84.21 Qwen-VL-7B 88.55 92.27 84.51 82.82 88.59 76.79 85.96 86.32 85.73 Ferret-13B 89.48 92.41 84.36 82.81 88.14 75.17 85.83 86.34 85.57 ASMv2 (ours) 90.56 94.24 86.24 84.81 90.83 76.89 87.52 88.26 87.42
-
Results on the Region Captioning task.
Model VG RefCOCOg METEOR CIDEr METEOR CIDEr GRiT 17.1 142.0 15.2 71.6 Kosmos-2 - - 14.1 62.3 GPT4RoI-7B 17.4 145.2 - - GPT4RoI-13B 17.6 146.8 - - ASM-FT 18.3 148.7 21.8 107.8 ASMv2 (ours) 17.9 153.5 21.7 114.7
-
Results on Visual Commonsense Reasoning.
$^*$ The single-task fine-tuning setting.Method Q $\rightarrow$ AQA $\rightarrow$ RQ $\rightarrow$ ARViLBERT 72.4 74.5 54.0 Unicoder-VL 72.6 74.5 54.5 VLBERT 75.5 77.9 58.9 ERNIE-ViL-L 78.5 83.4 65.8 VILLA 78.5 82.6 65.2 *GPT4RoI-7B 87.4 89.6 78.6 ASMv2 (ours) 87.8 88.8 78.4 *ASMv2 (ours) 88.4 89.9 79.4 -
Accuracy scores on CRPE.
Model Exist. Subj. Pred. Obj. Overall Qwen-VL 85.11 45.66 38.19 31.60 38.48 LLaVA-1.5 88.69 57.44 54.24 55.21 55.63 ASMv2 (ours) 92.14 69.21 58.95 65.34 64.50
Open-ended Scene Graph Generation (click to expand)
-
Recall scores on PSG.
Model #Tuples Recall mRecall IMP 20.0 16.5 6.5 MOTIFS 20.0 20.0 9.1 VCTree 20.0 20.6 9.7 GPSNet 20.0 17.8 7.0 PSGFormer 20.0 18.6 16.7 TextPSG 50.0 4.8 - TextPSG 100.0 5.5 - ASMv2 (ours) 9.2 14.2 10.3
We mainly use the LLaVA codebase to deveop ASMv2. Thanks for this great work. Here we present a brief introduction. In most cases, you will only need to refer to the new documentation that we have added. Please refer to the original documentation of LLaVA 1.5 for more detailed manual.
The training process of ASMv2 is divided into two stages, with each stage comprising a pre-training phase and an instruction-tuning phase.
In this phrase, we utilize the pretraining data filtered by LLaVA for training.
# stage1 pretrain
sh scripts_asmv2/stage1-pretrain.sh
We utilize the same instruction tuning data as LLaVA 1.5 in stage 1 while updating the format of region data into the format of ASMv2. The updated annotation is presented here.
You may download our finetuned checkpoint here and use it for the second-stage training.
# stage2 finetune
sh scripts_asmv2/stage1-finetune.sh
We employ 5M filtered samples from CC12M, 10M filtered samples from AS-1B, and 15M filtered samples from GRiT for pretraining.
You may download our pretrained checkpoint here and use it for the subsequent instruction tuning.
# stage1 pretrain
sh scripts_asmv2/stage2-pretrain.sh
We extend the instruction tuning data of LLaVA 1.5 with our AS-V2 and other region-level multi-modal corpora, such as AS-Core. Please download the annotation as_mix_4m.json, and the images from constituting datasets:
- COCO: train2017, train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
- LLaVAR: images
- ShareGPT4V-100K: WebData, SAM
- CLEVR: images
- ScienceQA: images
- ST-VQA: images
- DocVQA: images
- Visual7W: images
- Flickr30K: images
- VCR: images
- SA-1B: images
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ ├── train2014
│ └── train2017
├── coco_flip
│ ├── train2014
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
│ ├── VG_100K
│ └── VG_100K_2
├── LLaVAR
│ └── images
├── CLEVR_v1.0
│ └── images
├── ScienceQA
│ ├── train
├── ST-VQA
│ ├── coco-text
│ ├── ...
│ └── vizwiz
├── DocVQA
│ └── documents
├── web-celebrity
│ └── images
├── web-landmark
│ └── images
├── wikiart
│ └── images
├── Visual7W
│ └── images
├── flickr30k
│ └── images
├── VCR
│ ├── lsmdc_0001_American_Beauty
│ └── ...
├── sam
│ ├── sa_000000
│ ├── ...
│ ├── sa_000063
│ └── images
Note that sam/sa_0000xx
contains the images used by AS-Core while sam/images
contains those used by ShareGPT4V-100K.
Besides, images in the coco_flip
are the flipped version of those in the coco
. We utilize flip augmentation for grounding data (i.e., refcoco, refcoco+, and refcocog). To generate these flipped images, please refer to the script.
# stage2 finetune
sh scripts_asmv2/stage2-finetune.sh
NOTE: We implement an early stop at the end of the second epoch.
The method for testing the model remains the same as LLaVA-1.5; you just need to change the path of the script. Our scripts are located in scripts_asmv2/
.
For evaluation of our proposed CRPE benchmark and other region-level tasks, please download the evaluation data and put them in ./playground/data/eval
as the following structure.
Note that all of the box coordinates in these files are for the image after square pad.
├── crpe
│ ├── crpe_exist.jsonl
│ └── crpe_relation.jsonl
├── VCR
│ └── VCR_val_square_pad.jsonl
├── grounding
│ └── grounding_eval.jsonl
├── region_caption
│ ├── refcocog_val_coco_format.json
│ ├── refcocog_test_coco_format.json
│ ├── vg_test_coco_format.json
│ └── region_caption_eval_square_pad.jsonl
├── psg
│ └── psg_eval.jsonl
└── psg_pred_cls
├── psg_pred_cls_full_square_pad.jsonl
└── psg_pred_cls_options.json
- Referring Expression Comprehension
sh scripts_asmv2/eval/grounding.sh OpenGVLab/ASMv2
- Region Captioning
sh scripts_asmv2/eval/region_caption.sh OpenGVLab/ASMv2
- Referring Question Answering
sh scripts_asmv2/eval/vcr.sh OpenGVLab/ASMv2
- Open-ended Scene Graph Generation
sh scripts_asmv2/eval/psg.sh OpenGVLab/ASMv2
- Predicate Classification
sh scripts_asmv2/eval/psg_pred_cls.sh OpenGVLab/ASMv2