Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

Closed
AIVFI opened this issue Jan 28, 2024 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@AIVFI
Copy link

AIVFI commented Jan 28, 2024

You were right to write here that your new work may be of great interest to me. The level of detail retention is really impressive when compared to the baseline models and your work could revolutionise the way VFI models are developed, especially those for practical applications.

I am particularly grateful to you for training the [D,R] RIFE-vgg model and the comparison of x128 interpolation with the [D,R] RIFE and [T] RIFE models. You recommend [D,R] RIFE for more stable results and after seeing the GIF files with the interpolation results I fully share your opinion. These results mean that I will have to completely revise my introduction to Video Frame Interpolation Rankings and Video Deblurring Rankings.

I think, however, that there is a solution to get all the advantages of the [D,R] RIFE-vgg and [D,R] RIFE models while eliminating their disadvantages. That solution is Style loss, also called Gram matrix loss.

The first time Style loss to train the video frame interpolation model was used by Google Research when training their FILM-𝓛S model, link: ECCV The second time this loss was used by Disney Research when training their UGFI 𝓛S model, link: CVPR

Both models achieved some of the best and maybe even the best LPIPS results (because we do not have a direct comparison with the other two models of the top four, and the results are very close to each other):

Vimeo-90K triplet: LPIPS<=0.017 [excluding LPIPS(SqueezeNet) results]

Rank     Model        LPIPS ↓   
{Input fr.}
Training
dataset
Official
  repository  
Practical
model
VapourSynth
1 EAFI-𝓛ecp
arXiv
0.012 {2}
arXiv
Vimeo-90K triplet - EAFI-𝓛ecp -
2 UGFI 𝓛S
CVPR
0.0126 {2}
CVPR
Vimeo-90K triplet - UGFI 𝓛S -
3 SoftSplat - 𝓛F
CVPR
0.013 {2}
CVPR
Vimeo-90K triplet GitHub Stars SoftSplat - 𝓛F -
4 FILM-𝓛S
ECCV
0.0132 {2}
CVPR
Vimeo-90K triplet GitHub Stars FILM-𝓛S -
5 EDSC_s-𝓛F
TPAMI
0.016 {2}
arXiv
Vimeo-90K triplet GitHub Stars EDSC_s-𝓛F -
6 CtxSyn - 𝓛F
CVPR
0.017 {2}
CVPR
proprietary - CtxSyn - 𝓛F -

However, the most interesting thing is the visual comparison of the three loss functions. In my opinion, Style loss clearly gives the best result perceptually:

FILM - Loss Functions Ablation
Source: FILM - Loss Functions Ablation https://film-net.github.io/

Furthermore, interpolation with FILM-𝓛S can eliminate the artefacts seen with FILM-𝓛1, as can be seen in Fig. 1 in Supplementary Material.

More details on Style loss and more examples are on YouTube: https://www.youtube.com/watch?v=OAD-BieIjH4&t=160s

UGFI model trained with Style loss also retains an amazing amount of fine detail, as the examples at the bottom of Figure 6 in Supplementary Material show particularly well.

The Style loss equation can be found in Sec.3.1
The loss combination weights for the FILM-𝓛S model are in Sec. 1.1 in Supplementary Material.
The loss combination weights for the UGFI 𝓛S model are in Sec. 3.3

You did a great job with the [T] RIFE, [D,R] RIFE, [D,R] RIFE-vgg comparison. You've shown something I haven't seen anywhere so far, which is that at x128 interpolation VGG Loss can give more messy artifacts than the benefit of preserving fine detail. I think now more people may be seriously considering whether to use the perceptual loss function for practical purposes.

Therefore, I have a great request to you to train the [D,R] RIFE model using Style loss and compare it with the other 3 models whose x128 interpolation results you showed as GIF files. This may be the best model for practical use - retaining the most detail with your method combined with training using Style loss without the creation of messy artefacts.

I would like to include the results of the comparison you manage to achieve in the introduction to Video Frame Interpolation Rankings and Video Deblurring Rankings to draw the attention of other researchers to how to train VFI models for practical applications and of course to your method as well.

Also have a look at what your neighbours in your city developed a month ago: a no-reference Perceptual Quality Assessment for Video Frame Interpolation, in particular TABLE I.

@zzh-tech
Copy link
Owner

Hi, thanks for the advice and sorry for the late reply.
I can try it when I'm not busy.
But if you or anyone can help to run this experiment, that would be a big help.

@AIVFI
Copy link
Author

AIVFI commented Mar 20, 2024

Many thanks for your response and willingness to solve this intriguing problem.

I completely understand the lack of time as I know this problem from my example. I am ashamed of how my rankings look and the fact that I don't have time to update them. Unfortunately all my activity here is my hobby and my job for life is something completely different.

I would love to help, but unfortunately I'm not even a programmer, let alone something as complicated as training AI models. I'm planning to buy an NVIDIA GeForce RTX 5090 graphics card early next year, and if someone develop the software with a GUI to train AI models, then of course I'd be happy to help with training or other testing.

I will, however, try to help as much as I can to publicise this thread and maybe someone else will get interested in the topic and help with this experiment.

I think one thing would help a lot to increase interest in this experiment and your InterpAny-Clearer method in general.

If you could add to the comparison:

[T] RIFE
[D,R] RIFE (Ours)
[D,R] RIFE-vgg (Ours)

one more model, without any training:

[T] RIFE v4.15
link: https://github.com/hzwer/Practical-RIFE#model-list

I think generating an animated GIF file with x128 interpolation won't take much time and will greatly increase interest in your method and allow me to better publicise InterpAny-Clearer among the many enthusiasts who use this very practical model on a daily basis, rather than the base version of RIFE.

In my opinion, it is the comparison with the practical, current best version of RIFE i.e. RIFE v4.15 that will be the best confirmation of the need for the InterpAny-Clearer and give me the needed proof to publicise in my rankings and beyond the need to use this improvement in all practical models in the future, not only with RIFE.

Of course, I would also like to know what loss function to recommend in the introduction to my rankings so that future models use it too. This is why it is so important to find out which loss function works best with InterpAny-Clearer. You recommend using [D,R] RIFE rather than [D,R] RIFE-vgg and I agree that the example given with the young woman's face justifies this. However, I think it is worth doing this experiment to see if Style loss, also called Gram matrix loss will give an even better effect. Then we will have a clear proof of which loss function InterpAny-Clearer works best with.

@AIVFI
Copy link
Author

AIVFI commented May 30, 2024

see RIFE v4.17 - trained on Style loss, also called Gram matrix: https://github.com/hzwer/Practical-RIFE#model-list

@AIVFI AIVFI closed this as completed May 30, 2024
@zzh-tech zzh-tech added enhancement New feature or request help wanted Extra attention is needed labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants