Skip to content

The First Multimodal Seach Engine Pipeline and Benchmark for LMMs

Notifications You must be signed in to change notification settings

CaraJ7/MMSearch

Repository files navigation

MMSearch πŸ”₯πŸ”: Unveiling the Potential of Large Models as Multi-modal Search Engines

MultimodalSearch Multimodal AI Search Engine Multi-Modal

GPT-4o GPT-4V Claude-3.5

Official repository for "MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines".

🌟 For more details, please refer to the project page with dataset exploration and visualization tools.

[🌐 Webpage] [πŸ“– Paper] [πŸ€— Huggingface Dataset] [πŸ† Leaderboard] [πŸ” Visualization]

πŸ’₯ News

  • [2024.09.30] 🌏 We add MMSearch-Engine (for any new query) command line demo here!
  • [2024.09.25] 🌟 MMSearch now supports evaluation in lmms-eval! Details are here.
  • [2024.09.25] 🌟 The evaluation code now supports directly use models implemented in VLMEvalKit!
  • [2024.09.22] πŸ”₯ We release the evaluation code, which you only need to add an inference API of your LMM!
  • [2024.09.20] πŸš€ We release the arXiv paper and all MMSearch data samples in huggingface dataset.

πŸ“Œ ToDo

  • Coming soon: MMSearch-Engine demo

πŸ‘€ About MMSearch

The capabilities of Large Multi-modal Models (LMMs) in multimodal search remain insufficiently explored and evaluated. To fill the blank of a framework for LMM to conduct multimodal AI search engine, we first design a delicate pipeline MMSearch-Engine to facilitate any LMM to function as a multimodal AI search engine


To further evaluate the potential of LMMs in the multimodal search domain, we introduce MMSearch, an all-around multimodal search benchmark designed for assessing the multimodal search performance. The benchmark contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching.


An overview of MMSearch.

In addition, we propose a step-wise evaluation strategy to better understand the LMMs' searching capability. The models are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. The final score is weighted by the four tasks.


Outline of Evaluation Tasks, Inputs, and Outputs.

πŸ” An example of LMM input, output, and ground truth for four evaluation tasks


Evaluation

πŸ“ˆ Evaluation by yourself

Setup Environment

The environment is mainly for interacting with the search engine and crawling the website:

pip install requirements.txt
playwright install

Get your LMMs ready

(a). ✨ Evaluation with models implemented in VLMEvalKit

We now support directly using the models implemented in VLMEvalKit. The available name list of the model is here. You need to first install VLMEvalKit with the following command, or follow the guidance in its repo:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

Then you need to set up the model in VLMEvalKit as introduced in the step 1 in its Quickstart.

After you make sure you can infer with the model in VLMEvalKit by the command vlmutil check {MODEL_NAME}, you can use the model by simply adding the prefix vlmevalkit_ in front of the model name in the list. For example, to use llava_onevision_qwen2_7b_ov, your input model_type should be vlmevalkit_llava_onevision_qwen2_7b_ov. We provide an example of the requery task in scripts/run_requery_vlmevalkit.sh.

Note that, several models in VLMEvalKit do not support text-only inference, so it may not support end2end task (some queries in round1 do not have image input).

(b). πŸ’ͺ Evaluation with custom LMMs

Here, we support evaluation of any custom LMMs with only very little effort. To evaluate your LMM, you only need to provide an infer function, which takes the image files and text instructions as input and outputs the model response.

We implement the code of LLaVA-OneVision in models/llava_model.py. Adding a model is very simple with only two steps:

  1. Implement a class for the model. The model class must implement the infer function, which takes image files and text instructions as input. Please refer to models/llava_model.py for the illustration of input variable types.
  2. Add the model type in models/load.py. Then you can specify the model_type in your bash file and use your model!

Begin evaluation!

Note that there are four tasks for computing the final score of MMSearch: end2end, requery, rerank, and summarization.

The requery task is automatically evaluated when conducting the end2end task. Therefore, to evaluate all the tasks in MMSearch, you only need to conduct evaluation on the end2end, rerank and summarization tasks. The evaluation codes are as follows:

# end2end task
bash scripts/run_end2end.sh
# rerank task
bash scripts/run_rerank.sh
# summarization task
bash scripts/run_summarization.sh

After the three scripts complete, run the following code to get the final score:

bash scripts/run_get_final_score.sh

Here are some important notes:

  1. How to set the parameters?

    • We provide the example input args in the bash file mentioned above.
    • The end2end task needs to interact with the Internet and the search engine. Please adjust the timeout time in constants.py for loading the website according to your network status.
  2. Evaluation time and multiple gpus inference

    Typically, the end2end task takes the longest time since it conducts three rounds sequentially and needs to interacte with the Internet. We provide a very basic mechanism for inference with multiple GPUs, where we provide an example in scripts/run_rerank_parallel.sh . However, we do not recommend running end2end task with too many GPUs since it will hit the rate limit of the search engine API and refuse to respond. Normally, running end2end task will take up 3-5 hours for a single GPU.

Evaluation with lmms-eval

You need also to set up the environment specified above. Then you can simply run the evaluation with lmms-eval commands. Note that, lmms-eval now only supports evaluating MMSearch with LLaVA-OneVision. More models will be supported very soon!

Demo

We provide a command line demo of MMSearch-Engine for any new queries.

Prepare query

We provide query examples in demo/query_cli.json. For queries with image, you need to specify the path to the query_image and an URL of the query_image since Google Lens here only supports url input. An easy way to get an URL of an image is to upload it to any public GitHub repository. Then simply substitute blob with raw of the image URL:

{
    "query": "When is the US release date for this movie?",
    "query_image": "demo/demo.png",
    "query_image_url": "https://github.com/CaraJ7/MMSearch/raw/main/demo/demo.png"
}

For queries without image, you only need to specify the query and set query_image as null:

{
    "query": "When is the US release date for Venom: The Last Dance?",
    "query_image": null
}

Get the search result

To successfully search the image in Google Lens, make sure the search engine that playwright opens is in English. Otherwise, it will throw an error. To get the search result of your queries, simply run the following command. The parameters have the same meaning as the parameters in the end2end evaluation task script.

bash demo/run_demo_cli.sh

πŸ† Leaderboard

Contributing to the Leaderboard

🚨 The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!

To contribute your model to the leaderboard, please email the prediction files of four tasks to πŸ“«[email protected].

Data Usage

We release the MMSearch data for benchmarking on the leaderboard, which contains 300 queries and the middle results for step-wise evaluation.

You can download the dataset from the πŸ€— Huggingface by the following command (make sure that you have installed related packages):

from datasets import load_dataset

dataset = load_dataset("CaraJ/MMSearch")

βœ… Citation

If you find MMSearch useful for your research and applications, please kindly cite using this BibTeX:

@article{jiang2024mmsearch,
  title={MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines},
  author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Wu, Yanmin and Lei, Jiayi and Qiu, Pengshuo and Lu, Pan and Chen, Zehui and Song, Guanglu and Gao, Peng and others},
  journal={arXiv preprint arXiv:2409.12959},
  year={2024}
}

🧠 Related Work

Explore our additional research on Vision-Language Large Models:

About

The First Multimodal Seach Engine Pipeline and Benchmark for LMMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published