-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Naturalspeech 2 Implementation #2638
Conversation
Thank you for the PR! I heard some of the Naturalspeech 2 examples and they sound great. But are there any models/weights to test/download this with? Or is it possible to train our own language with this already? |
@kungfooman this is still work in progress since i can only work on weekends on this it'll take some time to complete this code, right now i am still resolving some errors. |
Status Update:
|
Let me know when you want this tested, I'd be happy to give it a run on a multispeaker use case |
there are still few bugs in pitch and duration pipeline i have resolved it , but i need test it once and complete the inference function, till then it's still not trainable, once it starts generating voice on toy dataset i'll post it here. |
@erogol score loss is not yet implemented , can you look at the forward function if it looks ok ? training script is running but inference is not complete. |
remaining_mask = torch.ones_like(latents, dtype=torch.bool) | ||
|
||
# Get random segment for the speech prompt | ||
speech_prompts, segment_indices = rand_segments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do it in train_step not to tie the way we set the prompt and the forward function.
TTS/tts/models/naturalspeech2.py
Outdated
|
||
# iterate over the batch dimension | ||
for i in range(latents.size(0)): | ||
remaining_mask[i, :, segment_indices[i] : segment_indices[i] + self.diff_segment_size] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it easier to just iterate over the remaning_latents
? remaining_mask
is not being used somewhere else look like.
TTS/tts/models/naturalspeech2.py
Outdated
remaining_latents_lengths = torch.tensor(remaining_latents.shape[1:2]).to(remaining_latents.device) | ||
|
||
# Encode speech prompt | ||
speech_prompts_enc = self.prompt_encoder(speech_prompts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make a separate function to compute the prompt and return the transposed tensor to get rid of the transposes below.
@manmay-nakhashi commented on things I found |
Update : code clean up is left , mostly model implementation is complete i am training on vctk and see if i am able to get some output. |
I think it is better to implement synthesize in the model going forward. It'd be more flexible. |
This is buggy -> https://github.com/coqui-ai/TTS/blob/755405d5ca5956dc073144c395332d1b24286cca/TTS/tts/models/naturalspeech2.py#LL757C65-L757C69 I think, you should be using the aligner attention not the predicted one while training |
d8f26f0
to
b6e3d5e
Compare
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Paper --> https://arxiv.org/pdf/2304.09116.pdf