-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lots of variance in 'best' #182
Comments
Hi there, thank you so much for your detailed feedback and thorough analysis! I'm glad to hear that IQA-PyTorch is proving useful in your model evaluation process. I’d be happy to help clarify your situation.
Image quality assessment can be quite subjective, and finding a universal metric that works for all scenarios is challenging. Here’s a more in-depth explanation if you’re interested in the technical details: Understanding Metric Variability:
Additional Points:
Choosing the Right Metric:For a more informed selection, we recommend referring to the performance evaluation protocol here: https://github.com/chaofengc/IQA-PyTorch/tree/main?tab=readme-ov-file#performance-evaluation-protocol. The best metric choice depends on the benchmark and your specific task. |
Thank you a lot for your detailed response with the explanations :) Also I was able to run piqe, qalign_4bit and qalign_8bit locally on my images in this test, thanks for adding :) (Ah thanks, I had simply sorted the metrics according to https://github.com/chaofengc/IQA-PyTorch/blob/main/docs/ModelCard.md where CKDN was listed under FR Methods this is why i had it there and was unsure on the scoring.) |
You're welcome! I'm glad to hear that you found the explanations helpful. Regarding While it’s an interesting method, its current performance still has room for improvement and does not standout in our current benchmark. |
Hey
First thank you for all your work :)
tldr: This is less of an issue, just more of a question: Is it normal that when using many different metrics, that they all seem to have different opinions on which images are best / only a few point to the same images being best (like dists&lpips-vgg on 90k, topiq_fr&lpips on 30k instead, topiq_fr-pipal&stlpips on 150k instead etc)?
I train sisr models as one of my hobbies, and I though I could maybe use metrics to find the best release-checkpoint of a model training (a dat2 model) I was doing.
So I scored the 7 val images I was using of these checkpoints (10k, 20k, 30k, ... 210k iterations) with a few metrics to find the best checkpoint.
I got different results : psnr said 70k is the best checkpoint, ssim said 10k, dists said 90k, lpips & topiq_fr said 30k, topiq_fr-pipal said 150k.
So I did a more extensive test and ran 68 metrics (could not run qalign because its ressource hungry, and piqe had some input shape errors) on these 7 val images and also hr on hr as a baseline/comparison checkpoint (because if hr-hr is not best in FR then something must have gone wrong).
I gathered the results in this google sheets: https://docs.google.com/spreadsheets/d/1NL-by7WvZyDMHj5XN8UeDALVSSwH70IKvwV65ATWqrA/edit?usp=sharing
I put in all the scored per checkpoint, visually highlighted the best in red, second best in blue and third best in green per metric, and also put underneath the checkpoints sorted per score per metric. Metrics are sorted according to the Model Cards for IQA-PyTorch documentation page.
Screenshot of the Spreadsheet:
While some checkpoints consistently show up (10k, 60k, 150k, ..) I was surprised how äh divergent the 'best' scoring checkpoint is between all these metrics. (I was simply expecting all the different metrics to a bit more consistently point towards the same checkpoint than this, but i am probably wrong looking at this sheet). My question was simply if this is normal/ if this experience is normal?
I was simply trying to find on which few metrics I could rely on so in the future they might help me find the best release candidate of a model training. Or to then compare the outputs of my already released models between each other on datasets with a few select metrics.
Anyway thank you for all your work :)
IQA-Pytorch is fantastic and made it simple for me to score outputs with multiple different metrics :)
(ah and if needed or of interest, all the image files used for this test / sheet can be found in this .tar file on google drive)
The text was updated successfully, but these errors were encountered: