This repository contains the implementation of the following work:
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
Fan Zhang∗, Shulin Tian∗, Ziqi Huang∗, Yu Qiao+, Ziwei Liu+
Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.
Overview of Evaluation Agent Framework. This framework leverages LLM-powered agents for efficient and flexible visual model assessments. As shown, it consists of two stages: (a) the Proposal Stage, where user queries are decomposed into sub-aspects, and prompts are generated, and (b) the Execution Stage, where visual content is generated and evaluated using an Evaluation Toolkit. The two stages interact iteratively to dynamically assess models based on user queries.
- Clone the repository.
git clone https://github.com/Vchitect/Evaluation-Agent.git
cd Evaluation-Agent
- Install the environment.
conda create -n eval_agent python=3.10
conda activate eval_agent
pip install -r requirements.txt
First, you need to configure the open_api_key
. You can do it as follows:
export OPENAI_API_KEY="your_api_key_here"
python open_ended_eval.py --user_query $USER_QUERY --model $MODEL
$USER_QUERY
can be any question regarding the model’s capabilities, such as ‘How well does the model generate trees in anime style?’$MODEL
refers to the image generation model you want to evaluate. Currently, we support four models: SD-14, SD-21, SDXL-1, and SD-3. You can integrate new models in the following path:./eval_agent/eval_models/
- Configure the VBench Environment
- You need to configure the VBench environment on top of the existing environment. For details, refer to VBench.
- Prepare the Model to be Evaluated
-
Download the weights of the target model for evaluation and place them in
./eval_agent/eval_models/{model_name}/checkpoints/
. -
Currently, we support four models: latte, modelscope, videocrafter-0.9, and videocrafter-2. These models may also have specific environment requirements. For details, please refer to the respective model links.
python eval_agent_for_vbench.py --user_query $USER_QUERY --model $MODEL
$USER_QUERY
need to be related to the 15 dimensions of VBench. These dimensions are:subject_consistency
,background_consistency
,motion_smoothness
,dynamic_degree
,aesthetic_quality
,imaging_quality
,object_class
,multiple_objects
,human_action
,color
,spatial_relationship
,scene
,temporal_style
,appearance_style
, andoverall_consistency
.$MODEL
refers to the video generation model you want to evaluate.
- Configure the T2I-CompBench Environment
- You need to configure the T2I-CompBench environment on top of the existing environment. For details, refer to T2I-CompBench.
- Prepare the Model to be Evaluated
python eval_agent_for_t2i_compbench.py --user_query $USER_QUERY --model $MODEL
$USER_QUERY
need to be related to the 4 dimensions of T2I-CompBench. These dimensions are:color_binding
,shape_binding
,texture_binding
,non-spatial relationship
.$MODEL
refers to the image generation model you want to evaluate.
We propose the Open-Ended User Query Dataset, developed through a user study. As part of this process, we gathered questions from various sources, focusing on aspects users consider most important when evaluating new models. After cleaning, filtering, and expanding the initial set, we compiled a refined dataset of 100 open-ended user queries.
Check out the details of the open-ended user query dataset
The three graphs give an overview of the distributions and types of our curated open queries set. Left: the distribution of question types, which are categorized as General
or Specific
. Middle: the distribution of the ability types, which are categorized as Prompt Following
, Visual Quality
, Creativity
, Knowledge
and Others
. Right: the distribution of the content categories, which are categorized as History and Culture
, Film and Entertainment
, Science and Education
, Fashion
, Medical
, Game Design
, Architecture and Interior Design
, Law
.
If you find our repo useful for your research, please consider citing our paper:
@article{zhang2024evaluationagent,
title = {Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models},
author = {Zhang, Fan and Tian, Shulin and Huang, Ziqi and Qiao, Yu and Liu, Ziwei},
journal={arXiv preprint arXiv:2412.09645},
year = {2024}
}