How to make a fast and "best" TTS system with Coqui TTS? #961
-
I have trained many models based on coqui TTS with a private dataset. At the end, tacotron-ddc has been kept for my production deployment. It is not too slow indeed; I managed to keep RTF < 0.03 on a Nvidia T4 GPU by limiting reduction factor to 4. Plus multiband-melgan and post processing like waveform-to-mp3, overall RTF < 0.05. Currently, this speed with good quality seems fine in my use case. However, with increasing concurent requests, I need to further lower the RTF without noticeable loss in synthesis quality. I have tried FastPitch. It is fast, especially for long sentences, but too sensitive to dataset quality and distribution. With my long-sentense-dominent dataset, FastPitch turns out to be very bad on synthesizing short sentences, but almost as good as tacotron-ddc on long sentence (while tacotron-ddc is good on almost everything). To solve this, I have used my tacotron-ddc to syntheses many sort sentences and used them to fine-tune FastPitch model. This somehow worked, but still well behind the tacotron. My questions are
Thanks! I will miss the meeting due to inaccessibility to the meeting platform. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hello, I understand you decided not to use VITS VCTK model because of high GPU memory usage? Though wouldn't multispeaker model would result if higher quality of generated speech? I'm right now training male subset of VCTK dataset to see how different it would be from pre-trained model. |
Beta Was this translation helpful? Give feedback.
-
(answered in the call) |
Beta Was this translation helpful? Give feedback.
(answered in the call)