How to make a fast and "best" TTS system with Coqui TTS? #961

tekinek · 2021-11-23T17:35:42Z

tekinek
Nov 23, 2021

I have trained many models based on coqui TTS with a private dataset. At the end, tacotron-ddc has been kept for my production deployment. It is not too slow indeed; I managed to keep RTF < 0.03 on a Nvidia T4 GPU by limiting reduction factor to 4. Plus multiband-melgan and post processing like waveform-to-mp3, overall RTF < 0.05. Currently, this speed with good quality seems fine in my use case. However, with increasing concurent requests, I need to further lower the RTF without noticeable loss in synthesis quality.

I have tried FastPitch. It is fast, especially for long sentences, but too sensitive to dataset quality and distribution. With my long-sentense-dominent dataset, FastPitch turns out to be very bad on synthesizing short sentences, but almost as good as tacotron-ddc on long sentence (while tacotron-ddc is good on almost everything). To solve this, I have used my tacotron-ddc to syntheses many sort sentences and used them to fine-tune FastPitch model. This somehow worked, but still well behind the tacotron.

My questions are

How to make a fast and "best" TTS system with Coqui TTS?
How to encourage people to share their experiences with Coqui TTS?
How to prepare the "optimal" collection of sentences for training a TTS model? Yes, there is related info in this repo, but is it possible for coqui to start a new repo for DATA only? Such as tools for selecting sentences for TTS-specific recordings, text normalizers, G2Ps...

Thanks! I will miss the meeting due to inaccessibility to the meeting platform.

Answered by erogol

Dec 13, 2021

(answered in the call)

View full answer

skol101 · 2021-11-25T10:10:27Z

skol101
Nov 25, 2021

Hello, I understand you decided not to use VITS VCTK model because of high GPU memory usage? Though wouldn't multispeaker model would result if higher quality of generated speech? I'm right now training male subset of VCTK dataset to see how different it would be from pre-trained model.

2 replies

tekinek Nov 28, 2021
Author

@skol101 yes, in my test, the VIST (1) has no advantage in inference speed, (2) speech quality is comparible to tacotron, and (3) its GPU memory consumption is exceptionally high (rapidly increases by sentense length).

skol101 Nov 28, 2021

@tekinek that's an interesting find, consider VITS paper claims very quick synethesis speed on the GPU, even though doesn't mention memory requirements.

erogol · 2021-12-13T09:11:05Z

erogol
Dec 13, 2021
Maintainer

(answered in the call)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make a fast and "best" TTS system with Coqui TTS? #961

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to make a fast and "best" TTS system with Coqui TTS? #961

tekinek Nov 23, 2021

Replies: 2 comments · 2 replies

skol101 Nov 25, 2021

tekinek Nov 28, 2021 Author

skol101 Nov 28, 2021

erogol Dec 13, 2021 Maintainer

tekinek
Nov 23, 2021

Replies: 2 comments 2 replies

skol101
Nov 25, 2021

tekinek Nov 28, 2021
Author

erogol
Dec 13, 2021
Maintainer