Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

AIVFI · 2024-01-28T01:20:44Z

You were right to write here that your new work may be of great interest to me. The level of detail retention is really impressive when compared to the baseline models and your work could revolutionise the way VFI models are developed, especially those for practical applications.

I am particularly grateful to you for training the [D,R] RIFE-vgg model and the comparison of x128 interpolation with the [D,R] RIFE and [T] RIFE models. You recommend [D,R] RIFE for more stable results and after seeing the GIF files with the interpolation results I fully share your opinion. These results mean that I will have to completely revise my introduction to Video Frame Interpolation Rankings and Video Deblurring Rankings.

I think, however, that there is a solution to get all the advantages of the [D,R] RIFE-vgg and [D,R] RIFE models while eliminating their disadvantages. That solution is Style loss, also called Gram matrix loss.

The first time Style loss to train the video frame interpolation model was used by Google Research when training their FILM-𝓛_S model, link: The second time this loss was used by Disney Research when training their UGFI 𝓛_S model, link:

Both models achieved some of the best and maybe even the best LPIPS results (because we do not have a direct comparison with the other two models of the top four, and the results are very close to each other):

Vimeo-90K triplet: LPIPS<=0.017 [excluding LPIPS(SqueezeNet) results]

Rank	Model	LPIPS ↓ {Input fr.}	Training dataset	Official repository	Practical model	VapourSynth
1	EAFI-𝓛_ecp	0.012 {2}	Vimeo-90K triplet	-	EAFI-𝓛_ecp	-
2	UGFI 𝓛_S	0.0126 {2}	Vimeo-90K triplet	-	UGFI 𝓛_S	-
3	SoftSplat - 𝓛_F	0.013 {2}	Vimeo-90K triplet		SoftSplat - 𝓛_F	-
4	FILM-𝓛_S	0.0132 {2}	Vimeo-90K triplet		FILM-𝓛_S	-
5	EDSC_s-𝓛_F	0.016 {2}	Vimeo-90K triplet		EDSC_s-𝓛_F	-
6	CtxSyn - 𝓛_F	0.017 {2}	proprietary	-	CtxSyn - 𝓛_F	-

However, the most interesting thing is the visual comparison of the three loss functions. In my opinion, Style loss clearly gives the best result perceptually:

Source: FILM - Loss Functions Ablation https://film-net.github.io/

Furthermore, interpolation with FILM-𝓛_S can eliminate the artefacts seen with FILM-𝓛₁, as can be seen in Fig. 1 in Supplementary Material.

More details on Style loss and more examples are on YouTube: https://www.youtube.com/watch?v=OAD-BieIjH4&t=160s

UGFI model trained with Style loss also retains an amazing amount of fine detail, as the examples at the bottom of Figure 6 in Supplementary Material show particularly well.

The Style loss equation can be found in Sec.3.1
The loss combination weights for the FILM-𝓛_S model are in Sec. 1.1 in Supplementary Material.
The loss combination weights for the UGFI 𝓛_S model are in Sec. 3.3

You did a great job with the [T] RIFE, [D,R] RIFE, [D,R] RIFE-vgg comparison. You've shown something I haven't seen anywhere so far, which is that at x128 interpolation VGG Loss can give more messy artifacts than the benefit of preserving fine detail. I think now more people may be seriously considering whether to use the perceptual loss function for practical purposes.

Therefore, I have a great request to you to train the [D,R] RIFE model using Style loss and compare it with the other 3 models whose x128 interpolation results you showed as GIF files. This may be the best model for practical use - retaining the most detail with your method combined with training using Style loss without the creation of messy artefacts.

I would like to include the results of the comparison you manage to achieve in the introduction to Video Frame Interpolation Rankings and Video Deblurring Rankings to draw the attention of other researchers to how to train VFI models for practical applications and of course to your method as well.

Also have a look at what your neighbours in your city developed a month ago: a no-reference Perceptual Quality Assessment for Video Frame Interpolation, in particular TABLE I.

zzh-tech · 2024-03-19T05:24:05Z

Hi, thanks for the advice and sorry for the late reply.
I can try it when I'm not busy.
But if you or anyone can help to run this experiment, that would be a big help.

AIVFI · 2024-03-20T02:17:07Z

Many thanks for your response and willingness to solve this intriguing problem.

I completely understand the lack of time as I know this problem from my example. I am ashamed of how my rankings look and the fact that I don't have time to update them. Unfortunately all my activity here is my hobby and my job for life is something completely different.

I would love to help, but unfortunately I'm not even a programmer, let alone something as complicated as training AI models. I'm planning to buy an NVIDIA GeForce RTX 5090 graphics card early next year, and if someone develop the software with a GUI to train AI models, then of course I'd be happy to help with training or other testing.

I will, however, try to help as much as I can to publicise this thread and maybe someone else will get interested in the topic and help with this experiment.

I think one thing would help a lot to increase interest in this experiment and your InterpAny-Clearer method in general.

If you could add to the comparison:

[T] RIFE
[D,R] RIFE (Ours)
[D,R] RIFE-vgg (Ours)

one more model, without any training:

[T] RIFE v4.15
link: https://github.com/hzwer/Practical-RIFE#model-list

I think generating an animated GIF file with x128 interpolation won't take much time and will greatly increase interest in your method and allow me to better publicise InterpAny-Clearer among the many enthusiasts who use this very practical model on a daily basis, rather than the base version of RIFE.

In my opinion, it is the comparison with the practical, current best version of RIFE i.e. RIFE v4.15 that will be the best confirmation of the need for the InterpAny-Clearer and give me the needed proof to publicise in my rankings and beyond the need to use this improvement in all practical models in the future, not only with RIFE.

Of course, I would also like to know what loss function to recommend in the introduction to my rankings so that future models use it too. This is why it is so important to find out which loss function works best with InterpAny-Clearer. You recommend using [D,R] RIFE rather than [D,R] RIFE-vgg and I agree that the example given with the young woman's face justifies this. However, I think it is worth doing this experiment to see if Style loss, also called Gram matrix loss will give an even better effect. Then we will have a clear proof of which loss function InterpAny-Clearer works best with.

AIVFI · 2024-05-30T19:52:58Z

see RIFE v4.17 - trained on Style loss, also called Gram matrix: https://github.com/hzwer/Practical-RIFE#model-list

AIVFI mentioned this issue May 26, 2024

Request to incorporate InterpAny-Clearer's technology in Practical-RIFE hzwer/Practical-RIFE#69

Open

AIVFI closed this as completed May 30, 2024

zzh-tech added enhancement New feature or request help wanted Extra attention is needed labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

AIVFI commented Jan 28, 2024

zzh-tech commented Mar 19, 2024

AIVFI commented Mar 20, 2024

AIVFI commented May 30, 2024

Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

Request for [D,R] RIFE trained on Style loss, also called Gram matrix loss (the best perceptual loss function) #12

Comments

AIVFI commented Jan 28, 2024

zzh-tech commented Mar 19, 2024

AIVFI commented Mar 20, 2024

AIVFI commented May 30, 2024