"Streaming" TTS results on the fly #592

KineticIsEpic · 2021-06-24T02:45:45Z

KineticIsEpic
Jun 24, 2021

Hi all. I'd like to forewarn everyone by saying I'm not at all knowledgeable in text to speech or machine learning tech. I have zero understanding of the papers and code behind all this. Most of what I say will probably either be obvious or total nonsense for all of you. I am just a hobbyist that likes to play around with TTS and Vocal Synthesis software. Also, apologies if this has already been brought up. I couldn't find any similar discussions.

I was wondering if there'd be any feasible way to reduce the apparent time it takes to get a result by generating speech in blocks and playing them back to the user as they're created. As it stands, even if my computer can generate speech 2x "real-time", I will still have to wait e.g. 1 minute for a 2 minute long block of text. Rendering on the fly would nearly eliminate such a delay.

From my look at the code it seems the networks process whole sentences worth of input at a time. Would it be feasible to make them process shorter chunks of input instead? (I'm assuming the answer's no (at least with current TTS models) but I figured I'd ask anyways)

As it stands, the frontend (TTS/utils/synthesizer.py) splits input into sentences already. I imagine one could, instead of looping through every sentence before saving the complete speech as a .wav, queue up each sentence for playback immediately after it's been generated. You'd have to account for timing, of course (the synthesizer would need to know an average render time per word/phoneme, look out for situations like ['Hi!', 'This is a really long sentence. blah blah blah yadayada...'], and withhold the first sentence to prevent gaps in playback). Would such a thing be feasible?

I'm not trying to beg anyone to add a feature for me. Just looking to satisfy my curiosity (and maybe try to work on it myself). Thanks to everyone that's worked on this project!

Jackiexiao · 2021-07-15T06:20:47Z

Jackiexiao
Jul 15, 2021

yes, it could be done, you could implement a streaming TTS and reduce the first package delay less than 250ms, I wonder why no one discuss this problem in this project...

0 replies

erogol · 2021-07-15T14:13:38Z

erogol
Jul 15, 2021
Maintainer

Streaming TTS has its use-case for sure but it is not imminent currently in our plans.

Being said that if someone is interested, I'd be happy to help.

Also, synthesize.py or the demo server is mostly just trying out models and not intended for any real-life deployment for now.

1 reply

Jackiexiao Jul 15, 2021

Get it, thx for your reply!

dmsviento · 2023-04-30T08:12:07Z

dmsviento
Apr 30, 2023

Not sure if this progressed but would be interested if this got any further.
Thanks

0 replies

mercuryyy · 2023-12-19T01:39:44Z

mercuryyy
Dec 19, 2023

Was anyone able to do this? would be awesome if you could share!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Streaming" TTS results on the fly #592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

"Streaming" TTS results on the fly #592

KineticIsEpic Jun 24, 2021

Replies: 4 comments · 1 reply

Jackiexiao Jul 15, 2021

erogol Jul 15, 2021 Maintainer

Jackiexiao Jul 15, 2021

dmsviento Apr 30, 2023

mercuryyy Dec 19, 2023

KineticIsEpic
Jun 24, 2021

Replies: 4 comments 1 reply

Jackiexiao
Jul 15, 2021

erogol
Jul 15, 2021
Maintainer

dmsviento
Apr 30, 2023

mercuryyy
Dec 19, 2023