Official repository for "MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines".
π For more details, please refer to the project page with dataset exploration and visualization tools.
[π Webpage] [π Paper] [π€ Huggingface Dataset] [π Leaderboard] [π Visualization]
- [2024.09.30] π We add MMSearch-Engine (for any new query) command line demo here!
- [2024.09.25] π MMSearch now supports evaluation in lmms-eval! Details are here.
- [2024.09.25] π The evaluation code now supports directly use models implemented in VLMEvalKit!
- [2024.09.22] π₯ We release the evaluation code, which you only need to add an inference API of your LMM!
- [2024.09.20] π We release the arXiv paper and all MMSearch data samples in huggingface dataset.
- Coming soon: MMSearch-Engine demo
The capabilities of Large Multi-modal Models (LMMs) in multimodal search remain insufficiently explored and evaluated. To fill the blank of a framework for LMM to conduct multimodal AI search engine, we first design a delicate pipeline MMSearch-Engine to facilitate any LMM to function as a multimodal AI search engine
To further evaluate the potential of LMMs in the multimodal search domain, we introduce MMSearch, an all-around multimodal search benchmark designed for assessing the multimodal search performance. The benchmark contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching.
In addition, we propose a step-wise evaluation strategy to better understand the LMMs' searching capability. The models are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. The final score is weighted by the four tasks.
Outline of Evaluation Tasks, Inputs, and Outputs.
The environment is mainly for interacting with the search engine and crawling the website:
pip install requirements.txt
playwright install
(a). β¨ Evaluation with models implemented in VLMEvalKit
We now support directly using the models implemented in VLMEvalKit. The available name list of the model is here. You need to first install VLMEvalKit with the following command, or follow the guidance in its repo:
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
Then you need to set up the model in VLMEvalKit as introduced in the step 1 in its Quickstart.
After you make sure you can infer with the model in VLMEvalKit by the command vlmutil check {MODEL_NAME}
, you can use the model by simply adding the prefix vlmevalkit_
in front of the model name in the list. For example, to use llava_onevision_qwen2_7b_ov
, your input model_type
should be vlmevalkit_llava_onevision_qwen2_7b_ov
. We provide an example of the requery task in scripts/run_requery_vlmevalkit.sh
.
Note that, several models in VLMEvalKit do not support text-only inference, so it may not support end2end task (some queries in round1 do not have image input).
(b). πͺ Evaluation with custom LMMs
Here, we support evaluation of any custom LMMs with only very little effort. To evaluate your LMM, you only need to provide an infer
function, which takes the image files and text instructions as input and outputs the model response.
We implement the code of LLaVA-OneVision in models/llava_model.py
. Adding a model is very simple with only two steps:
- Implement a class for the model. The model class must implement the
infer
function, which takes image files and text instructions as input. Please refer tomodels/llava_model.py
for the illustration of input variable types. - Add the model type in
models/load.py
. Then you can specify themodel_type
in your bash file and use your model!
Note that there are four tasks for computing the final score of MMSearch: end2end, requery, rerank, and summarization.
The requery task is automatically evaluated when conducting the end2end task. Therefore, to evaluate all the tasks in MMSearch, you only need to conduct evaluation on the end2end, rerank and summarization tasks. The evaluation codes are as follows:
# end2end task
bash scripts/run_end2end.sh
# rerank task
bash scripts/run_rerank.sh
# summarization task
bash scripts/run_summarization.sh
After the three scripts complete, run the following code to get the final score:
bash scripts/run_get_final_score.sh
Here are some important notes:
-
How to set the parameters?
- We provide the example input args in the bash file mentioned above.
- The end2end task needs to interact with the Internet and the search engine. Please adjust the timeout time in
constants.py
for loading the website according to your network status.
-
Evaluation time and multiple gpus inference
Typically, the end2end task takes the longest time since it conducts three rounds sequentially and needs to interacte with the Internet. We provide a very basic mechanism for inference with multiple GPUs, where we provide an example in
scripts/run_rerank_parallel.sh
. However, we do not recommend running end2end task with too many GPUs since it will hit the rate limit of the search engine API and refuse to respond. Normally, running end2end task will take up 3-5 hours for a single GPU.
Evaluation with lmms-eval
You need also to set up the environment specified above. Then you can simply run the evaluation with lmms-eval commands. Note that, lmms-eval now only supports evaluating MMSearch with LLaVA-OneVision. More models will be supported very soon!
We provide a command line demo of MMSearch-Engine for any new queries.
We provide query examples in demo/query_cli.json
. For queries with image, you need to specify the path to the query_image
and an URL of the query_image
since Google Lens here only supports url input. An easy way to get an URL of an image is to upload it to any public GitHub repository. Then simply substitute blob
with raw
of the image URL:
{
"query": "When is the US release date for this movie?",
"query_image": "demo/demo.png",
"query_image_url": "https://github.com/CaraJ7/MMSearch/raw/main/demo/demo.png"
}
For queries without image, you only need to specify the query and set query_image
as null
:
{
"query": "When is the US release date for Venom: The Last Dance?",
"query_image": null
}
To successfully search the image in Google Lens, make sure the search engine that playwright opens is in English. Otherwise, it will throw an error. To get the search result of your queries, simply run the following command. The parameters have the same meaning as the parameters in the end2end evaluation task script.
bash demo/run_demo_cli.sh
π¨ The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!
To contribute your model to the leaderboard, please email the prediction files of four tasks to π«[email protected].
We release the MMSearch data for benchmarking on the leaderboard, which contains 300 queries and the middle results for step-wise evaluation.
You can download the dataset from the π€ Huggingface by the following command (make sure that you have installed related packages):
from datasets import load_dataset
dataset = load_dataset("CaraJ/MMSearch")
If you find MMSearch useful for your research and applications, please kindly cite using this BibTeX:
@article{jiang2024mmsearch,
title={MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines},
author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Wu, Yanmin and Lei, Jiayi and Qiu, Pengshuo and Lu, Pan and Chen, Zehui and Song, Guanglu and Gao, Peng and others},
journal={arXiv preprint arXiv:2409.12959},
year={2024}
}
Explore our additional research on Vision-Language Large Models:
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- [MathVista] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
- [LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- [LLaMA-Adapter V2] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
- [ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
- [SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
- [SPHINX-X] Scaling Data and Parameters for a Family of Multi-modal Large Language Models
- [Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
- [PerSAM] Personalize segment anything model with one shot
- [CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching