XTTS v2 - please help out a noob #3457
Replies: 4 comments 2 replies
-
the finetuning setup is meant for you to further refined the base model when cloning your target voice. there are some instances where the base model doesn't perform to it's best on zero shot inference (unseen data). as such, fine tuning is still necessary for a subset of use-cases. |
Beta Was this translation helpful? Give feedback.
-
Hi @andykaufseo, If you've found answers to these, please update them here. Would like to hear from your experience. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I tried xtts base model with my own voice it got it quite close but the accent was very different. Will fine-tuning get my voice and accent correct, or close atleast?, And in finetuning do I need sample of only my voice or different voice with same accent |
Beta Was this translation helpful? Give feedback.
-
I have a long list of questions, recently starting using coqui (xtts v2 cloned voice), i have some experience with LLMs and more experience Stable Diffusion, but i'm just getting started with TTS (and STT). Here are my questions:
i tested xtts v2 with a cloned voice, it sounds great - a bit slow tho but no biggie - question is, if this sounds so good, why is there a fine tuning setup for this? i mean, do i fine tune the base with the cloned voice - to sound even more like the cloned voice? if that's the case, if i clone another voice (while using the fine tuned model with the first cloned voice, what will happen)
as of today, which is the best model for cloning voices? xtts2 sounds great, i remember trying bark a few months ago when it got out and it was a mess, so was vits (code was messed up, couldn't get it to work and the demo was bad).
i see that with cloned voices, xtts2 is a bit slow (not a pain in the ass, but could be faster). i see some things that claim to do almost real time voice gen (https://github.com/KoljaB/RealtimeTTS) - is this possible with xtts2?
i remember Bark could insert things like laughter, music, etc - could i use that in combination with xtts2
Beta Was this translation helpful? Give feedback.
All reactions