-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: use dtw-ra
with provided wordTimeline
#60
Comments
Thanks for the suggestion. It seems possible to allow to accept a precomputed recognition timeline in I understand that if you're already running speech recognition on the input anyway, then being able to reuse the recognition results in DTW-RA would save significant amount of computation. Also, You can also try using the |
Thanks for your kindly reply.
I'll try that. However, my product is a desktop application. Not every user has a high-performance machine (to run whipser locally); those with slower machines might prefer using web APIs like OpenAI or Azure to generate transcripts. So using dtw-ra with a precomputed recognition result would be ideal for that. |
Updates: I stumbled upon the Regardless, I would still like to use dtw-ra with the provided wordTimeline. |
Previously, in the past, when I started adding speech recognition, I did use I'm not sure I understand exactly how you are using this. I could add this sort of alignment to the Anyway, if what you mainly need is to remove / ignore music, you can also achieve that with much less complexity (at least in potential) by first applying source separation ( The thing is that source separation with MDX-NET (only currently supported model architecture) is currently slow, not because the model is slow (with GPU acceleration, it's actually pretty fast), but because the FFT (STFT) the model requires has very long windows that are expensive to compute. The current implementation is a kind of prototype, in a sense, and it uses single-threaded WebAssembly based FFT and STFT (KissFFT). I plan, in the future to transition ONNX based STFT (added in recent versions of the ONNX standard) which would also have native multi-threading. I could also, in that way, move a lot of the preprocessing and postprocessing to an ONNX model, running on native CPU/GPU, which would be faster. If you want to see just how fast GPU accelerated source separation can be, and test the results with some more expensive models, I highly recommend to try: https://github.com/Anjok07/ultimatevocalremovergui There are also command-line based utilities that use the same models, like: |
I'm using Echogarden to build a language learning tool. So, I want not just a Previously, I used whisper.cpp/OpenAI/etc. to get the transcript text and aligned the whole text with the audio using Echogarden. However, background noise and Now, I use whisper.cpp/OpenAI/etc. or even Maybe this snapshot of my project can help explain what I need. |
I also tried the |
If the text is generated via speech recognition, then the recognizer would produce word timestamps (no need for for alignment). So if you want phone-level timestamps, there are several ways to do it: Extracting phone-level timestamps from each recognized word boundaryThis would work if the word boundaries output by the recognizer are accurate. In the case of the Your suggestion of passing the recognized timeline to Take native segment output from Whisper and apply alignment to each individuallyThe native output of the (original, unmodified) Whisper model (both I can take each one of these and individually apply synthesis-based DTW alignment to them. That is a bit like what you're trying to do with As I mentioned, this was considered the "base" (simple) approach. I had it implemented in previous versions but removed it later. I can put it back in. I explained why the MDX-NET is originally distributed as an ONNX model. The version I use is the same one that is used by python-audio-separator. It's the exact same The slowness is not because of the MDX-NET source separation model itself, but because the input to the model requires extracting FFT (STFT and ISTFT) on large windows with many overlaps, which, with the extra pre- and post-processing needed is currently more expensive than running the inference itself. Also I'm not sure that are the crashes you are describing with long inputs. If you're using a Anyway, I gave links to two implementations that use more optimized FFT operations (I believe based on Pytorch FFT - not ONNX). On the faster models (like Another reason MDX-NET is slow is because it is originally designed for extracting vocals (not particularly speech) from music tracks so quality is very important. Here, we don't really need high quality separation. We need just enough so that the speech audio features would be similar enough to the reference audio features. It's also possible the high overlap is not needed. For the prototype implementation I decided to keep the same overlap to ensure it produces similar outputs to the reference. |
Thanks for the explanations! But I'm a little confused now. Here are my detailed procedures:
So how can I improve these procedures? Any suggestions? Thank you. |
Basically what I said is that I intend to add options to handle your needs and use case, since they are generally useful. You seem to mostly just want to have phonemes in your recognition timeline. That's not a problem to add. Right now the development is simply not as fast as in March-May of this year, where I made a huge number of improvements (released a new significant version almost every several days at some point) and spent hundreds of hours on development. That's just the way it is sometimes. This project has never been funded in any way and had been developed with a 0 budget. There are more and less active periods. The last few months have been less active, but I still try to handle issues on the tracker. |
I completely understand. Take your time please. For now, I'll work on improving my procedures based on the current codebase. Thank you so much. |
You mentioned you're possibly using cloud services for transcription, like OpenAI or Azure. OpenAI Whisper service is $0.006 per minute, that's $0.36 per hour. However, I was looking on LLM API pricing and I also noticed Groq's Whisper Large V3 (requires registration) is only $0.03/hour. Groq also publishes these limits for my (unpayed) account (but I'm not sure if they are actually free? If they are - that's a lot): Groq has an OpenAI compatible transcription endpoint at: https://api.groq.com/openai/v1/audio/transcriptions (which is also known to be very fast - here's a highly recommended site that publishes a leaderboard for cloud speech recognition services where Groq is rated very highly there on price/performance) I'm not sure if it provides all the features of OpenAI's servers (say, including the accurate word timestamps). I'll need to test that. I'll add support for custom endpoints for the So as you see, adding optional phoneme timestamps would also need to include the cloud engines, to get a more consistent coverage. I would love to just be able to implement all of these features immediately, but of course sometimes even small changes take time and effort. Right now these conversations are mostly helping me to discover new issues and ideas and prioritize. (I do have partially-committed work-in-progress code - especially on adding machine translation integration - so that takes the highest priority right now). |
I'm extensively using the
align
API in my product (Enjoy App, a language learning tool).Here is the standard procedure:
Generally, this process works well. However, if the audio contains music or other background noises, the alignments become inaccurate around those sections.
I believe using
dtw-ra
can resolve this issue.With the
dtw-ra
option, Echogarden generates awordTimeline
before creating alignments.echogarden/src/api/Alignment.ts
Line 180 in 48baa2f
In my case, the
wordTimeline
is already generated in Step 2.Therefore, I hope the
wordTimeline
can be passed as a parameter when usingdtw-ra
, like this:I hope this clarifies my request. Thank you.
The text was updated successfully, but these errors were encountered: