"Streaming" TTS results on the fly #592
Replies: 4 comments 1 reply
-
yes, it could be done, you could implement a streaming TTS and reduce the first package delay less than 250ms, I wonder why no one discuss this problem in this project... |
Beta Was this translation helpful? Give feedback.
-
Streaming TTS has its use-case for sure but it is not imminent currently in our plans. Being said that if someone is interested, I'd be happy to help. Also, synthesize.py or the demo server is mostly just trying out models and not intended for any real-life deployment for now. |
Beta Was this translation helpful? Give feedback.
-
Not sure if this progressed but would be interested if this got any further. |
Beta Was this translation helpful? Give feedback.
-
Was anyone able to do this? would be awesome if you could share! |
Beta Was this translation helpful? Give feedback.
-
Hi all. I'd like to forewarn everyone by saying I'm not at all knowledgeable in text to speech or machine learning tech. I have zero understanding of the papers and code behind all this. Most of what I say will probably either be obvious or total nonsense for all of you. I am just a hobbyist that likes to play around with TTS and Vocal Synthesis software. Also, apologies if this has already been brought up. I couldn't find any similar discussions.
I was wondering if there'd be any feasible way to reduce the apparent time it takes to get a result by generating speech in blocks and playing them back to the user as they're created. As it stands, even if my computer can generate speech 2x "real-time", I will still have to wait e.g. 1 minute for a 2 minute long block of text. Rendering on the fly would nearly eliminate such a delay.
From my look at the code it seems the networks process whole sentences worth of input at a time. Would it be feasible to make them process shorter chunks of input instead? (I'm assuming the answer's no (at least with current TTS models) but I figured I'd ask anyways)
As it stands, the frontend (TTS/utils/synthesizer.py) splits input into sentences already. I imagine one could, instead of looping through every sentence before saving the complete speech as a .wav, queue up each sentence for playback immediately after it's been generated. You'd have to account for timing, of course (the synthesizer would need to know an average render time per word/phoneme, look out for situations like ['Hi!', 'This is a really long sentence. blah blah blah yadayada...'], and withhold the first sentence to prevent gaps in playback). Would such a thing be feasible?
I'm not trying to beg anyone to add a feature for me. Just looking to satisfy my curiosity (and maybe try to work on it myself). Thanks to everyone that's worked on this project!
Beta Was this translation helpful? Give feedback.
All reactions