Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparative Analysis and Training Results of VITS2 with HifiGAN, iSTFT and BigVGAN #2

Open
shigabeev opened this issue Sep 2, 2023 · 8 comments

Comments

@shigabeev
Copy link

Greetings,

First and foremost, I'd like to extend my commendations on developing such an outstanding model; its performance surpasses anything I have personally trained thus far. It's a noteworthy contribution to the field, and I applaud your work.

I've conducted a series of training experiments to validate the efficiency and efficacy of your model. For ease of reference, I've made the training results, model weights, and TensorBoard logs publicly accessible. You can review them via the following Google Drive link:
Training Results and Model Weights

Moreover, I've prepared audio samples that compare the performance of your model with that of VITS2, HifiGAN, and BigVGAN. This will offer a comprehensive perspective on how your model stacks up against other state-of-the-art solutions in the domain.
Comparative Audio Samples

Best wishes

@FENRlR
Copy link
Owner

FENRlR commented Sep 3, 2023

A huge thank you for sharing the results. The main reason of using iSTFT here was its fast synthesis speed that it showed from its original VITS variant. As so, I would say the result is far beyond my expectations. Magnificent.

@shigabeev
Copy link
Author

@FENRlR do you know by chance the optimal configs for different sampling rates? I need 16kHz, 24kHz and 48kHz.

@FENRlR
Copy link
Owner

FENRlR commented Sep 4, 2023

Currently, no. It seems there were some issues with 16kHz sampling rate in the original iSTFT repo. I've never seen the other two, however.

@p0p4k
Copy link

p0p4k commented Sep 5, 2023

@FENRlR hi, can you add me on discord and ping me? (id -> p0p4k)'
thanks.

@DavidNTompkins
Copy link

Super neat! Was this on an A100? Looks like it took ~3 days?

@w11wo w11wo mentioned this issue Sep 8, 2023
@Insensiblee
Copy link

I downloaded the model from the web disk you provided, and reported this error when reasoning, do you know how to solve it?
RuntimeError: Error(s) in loading state_dict for SynthesizerTrn:
size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([155, 192]) from checkpoint, the shape in current model is torch.Size([205, 192]).

@shigabeev
Copy link
Author

I downloaded the model from the web disk you provided, and reported this error when reasoning, do you know how to solve it? RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([155, 192]) from checkpoint, the shape in current model is torch.Size([205, 192]).

Hey, it's possible that the repository have changed and some weight sizes don't match defaults anymore. The easiest way to run it is to go back to the commit that dates back to the time of the post, clone it, plug in the weights and launch it from there.

@FENRlR
Copy link
Owner

FENRlR commented Oct 20, 2023

@Insensiblee Before reverting back to that commit, have you tried changing symbols?
The length of symbols he used for Russian is exactly 155, while 205 is the length of the default symbol. So I'm 90% sure that
you've forgot to modify it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants