Skip to content

Commit

Permalink
Refine metrics and scripts & format
Browse files Browse the repository at this point in the history
  • Loading branch information
lixiang007666 committed Jul 26, 2024
1 parent 62a1f3f commit 47a9a7f
Show file tree
Hide file tree
Showing 52 changed files with 3,949 additions and 3,941 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,5 @@ src/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

*.log
24 changes: 24 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
exclude: '((generator.py)|(generated/.*))$'
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-docstring-first
- id: check-toml
- id: check-yaml
exclude: packaging/.*
args:
- --allow-multiple-documents
- id: mixed-line-ending
args: [--fix=lf]
- id: end-of-file-fixer

- repo: https://github.com/omnilib/ufmt
rev: v1.3.3
hooks:
- id: ufmt
additional_dependencies:
- black == 22.3.0
- usort == 1.0.2
220 changes: 58 additions & 162 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,207 +4,103 @@
<img src="imgs/onediff_logo.png" height="100">
</p>

1. [Introduction](#introduction) 🌟
2. [Installation](#installation) 🛠️
- [Prepare the OneDiff Environment](#prepare-the-onediff-environment)
- [Prepare Benchmark Environment](#prepare-benchmark-environment)
3. [Quick Start](#quick-start)
- [Generate Benchmark Images](#generate-benchmark-images)
- [Testing Using Multiple Indicators](#testing-using-multiple-indicators)
4. [References](#references) 📚
5. [Citing](#citing) 📖


This repository is used for evaluating the quality of generation after compilation acceleration using [OneDiff](https://github.com/siliconflow/onediff).

1. [Quick Start](#quick-start)
- [Set up the OneDiff environment](#1-set-up-the-onediff-environment)
- [Prepare Benchmark environment](#2-prepare-benchmark-environment)
2. [Models](#models)
- [Introduction](#introduction)
- [SDXL](#sdxl)
- [SD 1.5](#sd-15)
- [SVD](#svd)
4. [References](#references)


## Quick Start
## Installation

1. **Prepare the OneDiff environment.**

Follow the instructions to install OneDiff and other dependencies:
- [Community Edition (CE)](https://github.com/siliconflow/onediff/tree/main?tab=readme-ov-file#installation)
- [Enterprise Edition (EE)](https://github.com/siliconflow/onediff/blob/main/README_ENTERPRISE.md#diffusers-with-onediff-enterprise)
Follow the instructions to install OneDiff and other dependencies:
- https://github.com/siliconflow/onediff/tree/main?tab=readme-ov-file#installation

2. **Prepare Benchmark environment.**

The dataset used for quality benchmarking is the [Human Preference Dataset v2 (HPD v2)](https://huggingface.co/datasets/zhwang/HPDv2), which will be automatically downloaded when any script is executed.

<p align="center"><img src="imgs/overview.png"/ width="70%"><br></p>

Install the benchmark library:
```
pip3 install -r requirements.txt
pip3 install -e .
```
## Quick Start
Evaluating the use of all generative models is divided into two steps, taking the kolors model as an example:
## Models
### Introduction
Currently, the quality of SDXL, SD 1.5, and SVD after OneDiff acceleration has been benchmarked. For explanations of these metrics, please see: https://github.com/siliconflow/odeval/wiki/Datasets-and-evaluation-metrics-used-for-quality-benchmarking.
### SDXL
Run:
```
bash run_sdxl_tests.sh
```
**HPS v2** comparison results:
| Optimization Technique | Paintings | Photo | Concept-Art | Anime | Average Score |
|------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 26.58 ± 0.4468 | 24.31 ± 0.4500 | 26.55 ± 0.2888 | 28.81 ± 0.3119 | 26.56 |
| OneDiff DeepCache (CE) | 26.61 ± 0.4333 | 24.34 ± 0.4189 | 26.61 ± 0.2270 | 28.84 ± 0.3113 | 26.60 |
| OneDiff Quant (EE) | 27.87 ± 0.4419 | 25.70 ± 0.4253 | 27.86 ± 0.2222 | 29.93 ± 0.3920 | 27.84 |
| OneDiff Compile (CE) | 27.84 ± 0.4312 | 25.70 ± 0.4550 | 27.87 ± 0.2638 | 29.91 ± 0.3791 | 27.83 |
| Pytorch | 27.82 ± 0.4275 | 25.70 ± 0.4534 | 27.85 ± 0.2432 | 29.92 ± 0.3666 | 27.82 |
> [!NOTE]
Scores for four styles ("Animation", "Concept-art", "Painting", and "Photo") and the average score are provided. Higher scores indicate better image quality.
| Optimization Technique | SSIM | MSE | MAE |
|--------|--------|--------|--------|
| OneDiff Quant + OneDiff DeepCache (EE) | 0.7483 | 76.123 | 93.163 |
| OneDiff DeepCache (CE) | 0.7504 | 76.198 | 92.085 |
| OneDiff Quant (EE) | 0.8794 | 30.664 | 117.736|
| OneDiff Compile (CE) | 0.9380 | 16.155 | 93.989 |
| Pytorch | - | - | - | - | - |
<details>
<summary>CLIP Score comparison results:</summary>
| Optimization Technique | Paintings | Photo | Concept-Art| Anime | Average Score |
|--------------------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 35.46 | 34.44 | 35.24 | 31.85 | 34.25 |
| OneDiff DeepCache (CE) | 35.42 | 34.47 | 35.15 | 31.83 | 34.22 |
| OneDiff Quant (EE) | 35.88 | 34.74 | 35.53 | 31.80 | 34.49 |
| OneDiff Compile (CE) | 35.78 | 34.83 | 35.43 | 31.77 | 34.45 |
| Pytorch | 35.78 | 34.83 | 35.42 | 31.77 | 34.45 |
</details>
<details>
<summary>Average Aesthetic Score and Inception Score comparison results:</summary>
| Optimization Technique | Average Aesthetic Score | Average Inception Score |
|---------------------------------------|-------------------------|------------------------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 5.93 | 16.43 ± 3.75 |
| OneDiff DeepCache (CE) | 5.91 | 15.82 ± 3.80 |
| OneDiff Quant (EE) | 5.97 | 16.02 ± 4.60 |
| OneDiff Compile (CE) | 5.97 | 15.88 ± 4.43 |
| Pytorch | 5.97 | 15.80 ± 4.24 |
</details>
### SD 1.5
Run:
```
bash run_sd1_5_tests.sh
```
**HPS v2** comparison results:
| Optimization Technique | Paintings Score | Photo Score | Concept-Art Score | Anime Score | Average Score |
|-------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 24.11 ± 0.2549 | 25.10 ± 0.5905 | 24.07 ± 0.3959 | 25.45 ± 0.2102 | 24.68 |
| OneDiff DeepCache (CE) | 23.88 ± 0.4237 | 25.23 ± 0.4587 | 23.96 ± 0.4445 | 25.51 ± 0.2846 | 24.65 |
| OneDiff Quant (EE) | 24.68 ± 0.2271 | 25.54 ± 0.5553 | 24.73 ± 0.3563 | 26.02 ± 0.4202 | 25.24 |
| OneDiff Compile (CE) | 24.58 ± 0.3372 | 25.83 ± 0.3850 | 24.71 ± 0.4705 | 26.25 ± 0.2840 | 25.34 |
| Pytorch | 24.55 ± 0.3336 | 25.78 ± 0.3986 | 24.70 ± 0.4624 | 26.24 ± 0.2989 | 25.32 |
<details>
<summary>CLIP Score comparison results:</summary>
| Optimization Technique | Paintings Score | Photo Score | Concept-Art Score | Anime Score | Average Score |
|--------------------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 33.55 | 32.72 | 33.57 | 30.87 | 32.68 |
| OneDiff DeepCache (CE) | 33.62 | 32.79 | 33.48 | 30.96 | 32.71 |
| OneDiff Quant (EE) | 33.64 | 32.84 | 33.72 | 30.79 | 32.75 |
| OneDiff Compile (CE) | 33.75 | 33.00 | 33.63 | 30.91 | 32.82 |
| Pytorch | 33.76 | 32.98 | 33.62 | 30.96 | 32.83 |
</details>
<details>
<summary>Average Aesthetic Score and Inception Score comparison results:</summary>
| Optimization Technique | Average Aesthetic Score | Average Inception Score |
|---------------------------------------|-------------------------|------------------------------|
| OneDiff Quant + OneDiff DeepCache (EE) | 5.43 | 14.71 ± 3.70 |
| OneDiff DeepCache (CE) | 5.42 | 15.30 ± 4.59 |
| OneDiff Quant (EE) | 5.46 | 15.05 ± 4.31 |
| OneDiff Compile (CE) | 5.46 | 15.20 ± 4.07 |
| Pytorch | 5.46 | 15.25 ± 4.49 |
</details>
### 1. Generate benchmark images.
- On MS COCO-30K:
### SVD
Assume that the folders `kolors_torch_coco`, `kolors_oneflow_coco`, and `kolors_nexfort_coco` respectively store the original images, images compiled by the onediff's oneflow backend, and images compiled by the nexfort backend.
Run:
```
# Create a path to store the generated images.
mkdir /path/to/your/kolors_torch_coco
```
```
bash run_svd_tests.sh
```
```
# Original Pytorch generates reference images.
python3 models/kolors/text_to_image_kolors_quality_benchmark.py --dataset coco --csv-file resources/MS-COCO_val2014_30k_captions.csv --output-dir /path/to/your/kolors_torch_coco
```
> [!NOTE]
Evaluate using the last frame of the video.
```
# Accelerate using onediff's oneflow backend.
python3 models/kolors/text_to_image_kolors_quality_benchmark.py --compiler oneflow --dataset coco --csv-file resources/MS-COCO_val2014_30k_captions.csv --output-dir /path/to/your/kolors_oneflow_coco
```
**HPS v2** comparison results:
```
# Accelerate using onediff's nexfort backend.
python3 models/kolors/text_to_image_kolors_quality_benchmark.py --compiler nexfort --compiler-config '{"mode": "max-optimize:max-autotune:low-precision", "memory_format": "channels_last"}' --dataset coco --csv-file resources/MS-COCO_val2014_30k_captions.csv --output-dir /path/to/your/kolors_nexfort_coco
```
| Optimization Technique | Paintings Score | Photo Score | Concept-Art Score | Anime Score | Average Score |
|-----------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff DeepCache (CE) | 24.72 ± 0.1604 | 22.77 ± 0.0308 | 25.15 ± 0.2523 | 25.00 ± 1.0273 | 24.41 |
| OneDiff Quant + OneDiff DeepCache (EE) | 24.72 ± 0.0327 | 22.81 ± 0.0881 | 25.25 ± 0.0405 | 25.19 ± 0.8912 | 24.49 |
| OneDiff Compile (CE) | 25.84 ± 0.0566 | 24.54 ± 0.1882 | 26.43 ± 0.0194 | 26.79 ± 0.5265 | 25.90 |
| Pytorch | 25.82 ± 0.1076 | 24.28 ± 0.1298 | 26.48 ± 0.0792 | 26.82 ± 0.5806 | 25.85 |
- On Human Preference Dataset v2 (HPD v2):
Simply modify the `--dataset` parameters, do not read prompts from the `--csv-file` parameter, and customize the `--output-dir` for generating images. For example:
<details>
<summary>CLIP Score comparison results:</summary>
```
python3 models/kolors/text_to_image_kolors_quality_benchmark.py --dataset hps --output-dir /path/to/your/kolors_torch_hps
```
| Optimization Technique | Paintings Score | Photo Score | Concept-Art Score | Anime Score | Average Score |
|----------------------------------|-----------------|-------------|-------------------|-------------|---------------|
| OneDiff DeepCache (CE) | 31.75 | 30.52 | 30.68 | 29.42 | 30.59 |
| OneDiff Quant + OneDiff DeepCache (EE) | 31.82 | 30.54 | 30.83 | 29.38 | 30.64 |
| OneDiff Compile (CE) | 32.57 | 31.38 | 31.66 | 30.02 | 31.41 |
| Pytorch | 32.43 | 31.24 | 31.81 | 29.92 | 31.35 |
</details>
### 2. Test using multiple indicators with scripts.
<details>
<summary>Average Aesthetic Score and Inception Score comparison results:</summary>
```
bash scripts/run_kolors_tests.sh coco
bash scripts/run_kolors_tests.sh hps
```
| Optimization Technique | Average Aesthetic Score | Average Inception Score |
|---------------------------------------|-------------------------|---------------------------|
| OneDiff DeepCache (CE) | 5.32 | 7.63 ± 2.19 |
| OneDiff Quant + OneDiff DeepCache (EE) | 5.31 | 7.86 ± 2.25 |
| OneDiff Compile (CE) | 5.48 | 8.18 ± 2.33 |
| Pytorch | 5.50 | 7.88 ± 1.97 |
</details>
A quality report can refer to: [models/kolors/README.md](models/kolors/README.md)
## References
### References
- Maximilian Seitzer. Compute FID scores with PyTorch. https://github.com/mseitzer/pytorch-fid. 2020
- Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., & Li, H. (2023). Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv preprint arXiv:2306.09341.
- SUN Zhengwentai. clip-score: CLIP Score for PyTorch. https://github.com/Taited/clip-score, 2023.
- Christoph Schuhmann. CLIP+MLP Aesthetic Score Predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2022.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. NeurIPS, 29, 2016.
## Citing
Evaluating the use of all generative models is divided into two steps:
1. Generate a large number of benchmark images.
2. Test using multiple indicators with scripts.
```
@misc{odeval,
author = {Xiang Li and others},
title = {odeval: A Library for benchmarking the accelerated generation quality},
year = {2023},
publisher = {SiliconFlow},
howpublished = {\url{https://github.com/siliconflow/odeval}},
note = {Accessed: 2024-07-26}
}
```
Loading

0 comments on commit 47a9a7f

Please sign in to comment.