-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other languages? #71
Comments
I think the easiest approach would be to just change the model (line 267). I default to: Let me know if you are up for trying this yourself, or if you'd like me to help out a little. |
Thank you. I'm getting this error using "tts_models/es/css10/vits" or "tts_models/es/mai/tacotron2-DDC", the only two models that tts lists for Spanish.
Then I went to the pointed 347 line and removed the speaker parameter, and it worked (more or less, I got some jumps and things like that). Now, could you help me please to use the XTTS thing? Where to set the language to Spanish (es) and where to get the sample.wav for my language? As you see I don't really know what I'm doing, but I'm trying to! EDIT: changed in line 344 |
I just merged an improvement that will make things easier. For instance to use the vits spanish model, you would call it this way: To use XTTS with spanish, nothing new/different is required. You just need to supply a voice sample wav file (30 seconds or less) that you would like XTTS to clone. For instance, I tested with a bit of spanish text and it sounded good: To get a sample for cloning, if you use Chrome this is by far the easiest approach. First install this plugin: Then find a audio book sample on amazon that you really like. Listen to the sample and capture 10-30 seconds of the person speaking. Then convert that to a wav file ("ffmpeg -i voicemod-file.mp3 sample.wav") and you should be good to go! Let me know if this works out for you, or if you have any other questions! |
OH I forgot to mention regarding XTTS - a GPU is required. It might run with just CPU but I don't think it will, and even if it did I suspect it would be so slow as to be unusable. |
Thank you very much!!!! |
Hello again. I've searched the code and changed the occurences of What can I do to fix the model to Spanish? The output of the command:
|
I'm not sure there's anything to be done, but I'll ask on the TTS discord. From what I've seen discussed by other people, the XTTS v2 model is multilingual and should pick up speaking characteristics from the samples provided. The changes you made would likely have no impact on the TTS parts since that's just used to try to detect how to segment a given set of text into discrete sentences. It could help it segment sentences better though at least. |
You could try this, from someone on Coqui TTS Discord who said they were not getting expected results in Spanish. pip install --upgrade TTS
If you do this, you can try putting that downloaded model in a new folder, comment out line 56, and add a line: |
I am quite interested in this. Please if you finally find out how to make it work for Spanish I would appreciate if you can summarize all the changes and steps required |
So I moved model to
The text I'm using
No matter which voice I use. One example. I'm running this in a LXD (LXC) container. Tell me if I can help to get this working, please. EDIT1: I've changed line 178 from "en" to "es" and got a good result! Now sounds Spanish. EDIT2: I'm thinking that most orthographic or spelling exceptions are not recognized. Changing the line 56 back (using the 2.0 model) gives the same result. Sounds like "j" (in Spanish is like a hard "h" in he or have") are read like in a "j" in english, etc. EDIT3: in the demo space for xtts https://huggingface.co/spaces/coqui/xtts, the "ñ" and the "ü" in words like "pequeña, cigüeña, niño, avergüenza, lingüistico, ordeña, jota, ajeno, jaleo" are well pronounced, so there must be some parameter to fix it?. Using the default English voice, I got this perfect pronunciation example output.mp4 |
This PR on Coquit-TTS will be merged soon, which will include language and speaker samples from what used to be their paid studio product (and those voices sounded great!) As soon as that merges, I'll add a --language option that should be a big help for this. |
Not sure how closely you're watching the coqui repo, but just to let you know that it looks like that pull request has been murged a short time ago and is available in v0.22.0. |
Hello again. Thank you for your time and patience.
And it worked fine! I can see that read_chunk_xtts containts a call to model.inference_stream. And this is the max I can understand. I can't see why this is not working for other languages. Maybe they have better language detection or management in model.synthesize function or alike. I openned a discussion in Coqui TTS coqui-ai/TTS#3426. |
I'm no expert, although there may be some changes that aedocw is planning on making to support a --language flag as well as the newly murged studio speakers, so the --language flag will probably help quite a bit with this. |
@aedocw The last update fixed all the annoyances. Thank you very much! |
Any help setting the language to italian? I've tried setting |
aedocw will probably chime in here, although just to say that I don't believe the language flag has been implemented in epub2tts yet, although I believe it is on the roadmap going by a previous comment that was made. |
Oh, I've read about it on the |
Seems this commit slipped past my reading, and it looks as though indeed this flag has been implemented. |
I think I can get something up today to use the language flag with xtts language_idx. It just gets a little bit messy since there are two places within coqui-TTS to specify language - it comes back to whether or not you're using XTTS basically. I'll be able to poke at this late this afternoon my time (I'm in pacific TZ) :) |
I don't want to take time away from you if it something you do not intend to implement in the near future. I'm fine with a temporary solution. Even better if said solution makes me think a bit about the code since I'm in the process of learning python :) PS: I was not using XTTS (just launching |
No not at all, this is work I definitely planned to implement, I'm glad you are asking about it! XTTSv2 has MUCH more human sounding voices. I added one sample to the repo (sample-shadow-coquiXTTS.m4b), but I really need to put more samples in a permanently accessible place. For instance here are samples of all Coqui's studio voices using XTTS: https://drive.google.com/drive/folders/1roXMrd7peX-zApvyogqfsjyi9nPNKNF_?usp=sharing XTTS is a model that allows you to relatively easily clone voices. With some fine tuning, you can take 8-10 minutes of a recorded voice and get a speech model that sounds VERY VERY much like that person. A really important thing to know though is that you need a GPU to use XTTS voices. If you don't have a GPU, technically I think it's possible but it's so slow as to be useless. (The work happening with StyleTTS2 is probably going to allow for very life-like voices without requiring a GPU, but that is probably a few months away from being read to use here.) |
So, if I understand correctly, if I get 3 audio sample of around 30sec each with a voice of my choice, then pass them to Update: I've tried the previous command and it works fine, the audio is super clear. Only downside, the voice keeps spelling out "dot" quite often. Is there a way to correct that? Edit 2: Supposing the script is using the GPU, is it normal for the conversion to take around 15-16 hours for a 800-ish pages book? |
Yes, it will work like that for reals! A 3070 should be just fine (as it seems you've discovered). Regarding the voice spelling out "dot", can you paste a sample of text that leads to that? I'm always finding characters or things that confuse the synthesizer, and I add them to a section that replaces that text with a comma (or deletes it entirely). Using the GPU, I think 15-16 hours for an 800 page book is probably correct. For me I think it's reading around real-time, so a book that is 8 hours spoken takes around 8 hours. |
Ok perfect. Thanks a lot for the clarifications.
Both dots are spelled out. Edit: I've modified the epub changing the quotation brackets in the usual |
Hmm, this could get complicated with other languages. I had some very harsh text reformatting going on that would try to match everything to unicode, and remove pretty much any special character. With that code, here's what would happen: This basically assumes you're trying to speak english, so gridò and perché would probably be mispronounced. At the very least characters like "«" and "»" can safely be removed since the text-to-speech doesn't do anything special with a phrase in quotes (as far as I can tell). It's also possible that the period in your text is in a different character set so it's confusing the TTS. I'll see if I can confirm that, and maybe at least translate the period. |
Actually, both gridò and perché are not mispronounced, so that's not a problem. If you're able to add the quotation brackets to the reformatting method would be super. As for the period, I tried changing every period to the '.' on my keyboard, but the result does not change. I don't know how, but if you can solve that then the tts would be actually perfect. Edit: I forgot to mention, not all periods are spelled out, but I can't find anything in common between the spelled period vs the non spelled ones. Edit 2: After some tests, I've noticed that something that may be useful: when the period is spelled the result can be either:
Edit 3: Last update for today: after listening to the audio for a million times, it sounds like the voice is sometimes saying "punto tondo" that means "round point"/"round dot" in Italian, which can mean the actual symbol is not being recognize as a proper period. The weird sound reported in Edit 2 is still there in some other circumstances. Furthermore, I noticed that changing the samples changes the result, meaning that some periods that were spelled are treated normally after changing samples, but at the same time other periods get spelled. Are there any advices about how the samples should be chosen? |
The branch "more-text-cleanup" has what seems to work for me. It does seem like the text-to-speech wants to pronounce periods as punto. I tried replacing it with unicode character for full-stop (chr(0x002E)) but that had no impact. Replacing periods with commas seems to work though, and still causes TTS to take a beat before speaking the next part (vs. just removing the period which then makes all sentences run together). If you could try this branch and let me know how it sounds to you, I would appreciate it. Thanks! |
Oh BTW I tested with the following command: |
I'll try it tomorrow morning as soon as I have some spare time and update here. In the meanwhile thank you man, really appreciate the work. Update: Sorry for the late update, but I did some testing following your directions.
This said, even just addressing the first 2 points makes the audio perfectly comprehensible with some minor annoyance tbh. Let me know if you need something else. Edit: One note: I did the substitution ( |
Hello. Just discovered this.
Is there a way to set language? Maybe changing speaker?
I'd like to read Spanish epub.
The text was updated successfully, but these errors were encountered: