Lots of variance in 'best' #182

Phhofm · 2024-08-29T10:10:18Z

Hey

First thank you for all your work :)

tldr: This is less of an issue, just more of a question: Is it normal that when using many different metrics, that they all seem to have different opinions on which images are best / only a few point to the same images being best (like dists&lpips-vgg on 90k, topiq_fr&lpips on 30k instead, topiq_fr-pipal&stlpips on 150k instead etc)?

I train sisr models as one of my hobbies, and I though I could maybe use metrics to find the best release-checkpoint of a model training (a dat2 model) I was doing.
So I scored the 7 val images I was using of these checkpoints (10k, 20k, 30k, ... 210k iterations) with a few metrics to find the best checkpoint.
I got different results : psnr said 70k is the best checkpoint, ssim said 10k, dists said 90k, lpips & topiq_fr said 30k, topiq_fr-pipal said 150k.

So I did a more extensive test and ran 68 metrics (could not run qalign because its ressource hungry, and piqe had some input shape errors) on these 7 val images and also hr on hr as a baseline/comparison checkpoint (because if hr-hr is not best in FR then something must have gone wrong).

I gathered the results in this google sheets: https://docs.google.com/spreadsheets/d/1NL-by7WvZyDMHj5XN8UeDALVSSwH70IKvwV65ATWqrA/edit?usp=sharing
I put in all the scored per checkpoint, visually highlighted the best in red, second best in blue and third best in green per metric, and also put underneath the checkpoints sorted per score per metric. Metrics are sorted according to the Model Cards for IQA-PyTorch documentation page.

Screenshot of the Spreadsheet:

While some checkpoints consistently show up (10k, 60k, 150k, ..) I was surprised how äh divergent the 'best' scoring checkpoint is between all these metrics. (I was simply expecting all the different metrics to a bit more consistently point towards the same checkpoint than this, but i am probably wrong looking at this sheet). My question was simply if this is normal/ if this experience is normal?

I was simply trying to find on which few metrics I could rely on so in the future they might help me find the best release candidate of a model training. Or to then compare the outputs of my already released models between each other on datasets with a few select metrics.

Anyway thank you for all your work :)
IQA-Pytorch is fantastic and made it simple for me to score outputs with multiple different metrics :)

(ah and if needed or of interest, all the image files used for this test / sheet can be found in this .tar file on google drive)

chaofengc · 2024-08-30T07:35:03Z

Hi there, thank you so much for your detailed feedback and thorough analysis! I'm glad to hear that IQA-PyTorch is proving useful in your model evaluation process. I’d be happy to help clarify your situation.

In short: Yes, it is quite normal for different metrics to yield varying results when assessing image quality.

Image quality assessment can be quite subjective, and finding a universal metric that works for all scenarios is challenging. Here’s a more in-depth explanation if you’re interested in the technical details:

Understanding Metric Variability:

Metrics Target Different Aspects: When you have a reference image, such as in SISR, FR metric is generally more reliable than NR metrics. FR measures the difference between the generated image and the reference, while NR/IAA consider broader aesthetic qualities without a clear standard. Tasks with reference images are typically less ambiguous.

In NR metrics, you might not always see the HR image performing best. This is because FR focuses on measuring differences against an implicit image distribution embedded within the model. This distribution might not perfectly align with your specific images or tasks.
Metric Types: FR metrics come in two flavors: learning-based and handcrafted features. Methods like fsim and after rely on handcrafted features, primarily considering pixel-level information. They might miss capturing higher-level features like textures and shapes. This might explain why their best checkpoints cluster around similar values (10k or 70k).
Learning-Based Metrics: When using learning-based metrics like dists&lpips-vgg, topiq_fr&lpips, or topiq_fr-pipal&stlpips, it's expected that models trained on the same dataset will show similar performance (e.g., metrics with lpips all trained with BAPPS). However, real-world scenarios can be more complex:
- Pretraining: Some metrics like topiq rely on an ImageNet-pretrained model, while others like pieapp are trained from scratch. Generally, models pretrained on larger datasets perform better and generalize more effectively (e.g., qalign).
- Architecture: The model architecture itself can influence results. For example, lpips and lpips+ might exhibit different behaviors.
- Image Scale: Resizing and preprocessing can affect image quality. While IQA-PyTorch minimizes resizing, the influence of image size shouldn't be ignored. Your large images (often 1080p) might contribute to the variance between metrics.

Additional Points:

We appreciate you reporting the shape error in piqe. This has been fixed in the latest commit.
ckdn is actually an NR metric, where higher values indicate better quality.
For resource-constrained environments, we've added qalign_8bit and qalign_4bit. qalign offers excellent generalization due to its large-scale image pretraining.

Choosing the Right Metric:

For a more informed selection, we recommend referring to the performance evaluation protocol here: https://github.com/chaofengc/IQA-PyTorch/tree/main?tab=readme-ov-file#performance-evaluation-protocol. The best metric choice depends on the benchmark and your specific task.

Phhofm · 2024-09-01T17:09:52Z

Thank you a lot for your detailed response with the explanations :)

Also I was able to run piqe, qalign_4bit and qalign_8bit locally on my images in this test, thanks for adding :)

(Ah thanks, I had simply sorted the metrics according to https://github.com/chaofengc/IQA-PyTorch/blob/main/docs/ModelCard.md where CKDN was listed under FR Methods this is why i had it there and was unsure on the scoring.)

chaofengc · 2024-09-02T06:28:07Z

You're welcome! I'm glad to hear that you found the explanations helpful.

Regarding ckdn, I apologize for any confusion in my previous explanation. ckdn is a unique type of IQA metric, often referred to as a degraded reference metric, specifically designed for image restoration tasks. Unlike traditional metrics that require pristine high-quality ground truth, ckdn uses pairs of low-quality images and their corresponding restored images as inputs. Therefore, it doesn't require a high-quality reference image, but it does rely on the restored image as a reference, therefore I put it in the FR metric.

While it’s an interesting method, its current performance still has room for improvement and does not standout in our current benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of variance in 'best' #182

Lots of variance in 'best' #182

Phhofm commented Aug 29, 2024

chaofengc commented Aug 30, 2024

Phhofm commented Sep 1, 2024

chaofengc commented Sep 2, 2024 •

edited

Loading

Lots of variance in 'best' #182

Lots of variance in 'best' #182

Comments

Phhofm commented Aug 29, 2024

chaofengc commented Aug 30, 2024

Understanding Metric Variability:

Additional Points:

Choosing the Right Metric:

Phhofm commented Sep 1, 2024

chaofengc commented Sep 2, 2024 • edited Loading

chaofengc commented Sep 2, 2024 •

edited

Loading