Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: use dtw-ra with provided wordTimeline #60

Open
an-lee opened this issue Jul 4, 2024 · 11 comments
Open

Feature Request: use dtw-ra with provided wordTimeline #60

an-lee opened this issue Jul 4, 2024 · 11 comments
Labels
alignment Issue related to forced alignment feature Issue proposes a new feature

Comments

@an-lee
Copy link

an-lee commented Jul 4, 2024

I'm extensively using the align API in my product (Enjoy App, a language learning tool).

Here is the standard procedure:

  1. The user uploads an audio file.
  2. The audio is transcribed using whisper.cpp/OpenAI/Azure to obtain the transcript.
  3. The audio is aligned with the transcript using Echogarden.

Generally, this process works well. However, if the audio contains music or other background noises, the alignments become inaccurate around those sections.

I believe using dtw-ra can resolve this issue.

With the dtw-ra option, Echogarden generates a wordTimeline before creating alignments.

let { wordTimeline: recognitionTimeline } = await API.recognize(sourceRawAudio, recognitionOptions)

In my case, the wordTimeline is already generated in Step 2.

Therefore, I hope the wordTimeline can be passed as a parameter when using dtw-ra, like this:

Echogarden.align(audio, transcript, { engine: 'dtw-ra', wordTimeline: wordTimeline })

I hope this clarifies my request. Thank you.

@rotemdan
Copy link
Member

rotemdan commented Jul 4, 2024

Thanks for the suggestion. It seems possible to allow to accept a precomputed recognition timeline in dtw-ra. I'll look into that.

I understand that if you're already running speech recognition on the input anyway, then being able to reuse the recognition results in DTW-RA would save significant amount of computation.

Also, dtw-ra supports any recognition engine, including whisper.cpp, so the recognition stage can be faster than with the built-in Whisper (ONNX-based). Being able to provide a precomputed recognition result would also allow you to apply any sort of processing to the timeline, or use a recognition method that is external to Echogarden (although in that case you'll need to produce the word timeline yourself).

You can also try using the whisper alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.

@rotemdan rotemdan added alignment Issue related to forced alignment feature Issue proposes a new feature enhancement New feature or request and removed feature Issue proposes a new feature labels Jul 4, 2024
@an-lee
Copy link
Author

an-lee commented Jul 4, 2024

Thanks for your kindly reply.

You can also try using the whisper alignment engine. It has been redone in the past few months. Due to its use of specialized forced decoding (not conventional recognition), it may be able to produce better results than DTW-RA in some cases.

I'll try that.

However, my product is a desktop application. Not every user has a high-performance machine (to run whipser locally); those with slower machines might prefer using web APIs like OpenAI or Azure to generate transcripts. So using dtw-ra with a precomputed recognition result would be ideal for that.

@rotemdan rotemdan added feature Issue proposes a new feature and removed enhancement New feature or request labels Jul 20, 2024
@an-lee
Copy link
Author

an-lee commented Jul 23, 2024

Updates:

I stumbled upon the alignSegments API in the source code. It's a game-changer for my problem. Whisper usually transcribes background music in audio as [music], while other voices have accurate startTime and endTime. This lets me create a SegmentTimeline. Then, using the Echogarden.alignSegments API, I can get pretty good results.

Regardless, I would still like to use dtw-ra with the provided wordTimeline.

@rotemdan
Copy link
Member

rotemdan commented Jul 23, 2024

alignSegments is currently only used by the google-translate synthesis engine (and the defunct streamlabs-polly). I used it because I synthesize individual fragments of the text, sized to fit the maximum length that Google Translate allows, and then apply word alignment to each segment individually. This ensures the original (accurate) boundaries between segments are preserved.

Previously, in the past, when I started adding speech recognition, I did use alignSegments to add word timestamps the output of the segments recognized by Whisper, but later I transitioned to a more sophisticated approach that uses the internal cross-attention weights of the model.

I'm not sure I understand exactly how you are using this. [music] tokens (and similar ones like [applause] or [laughter]) are intentionally suppressed by the ONNX whisper. I'm assuming you are using the whisper.cpp version. In that case I'm not sure why you need to apply alignment this way, since whisper.cpp has its own interpolation-based word-level alignment.

I could add this sort of alignment to the whisper.cpp result (take sections and apply synthesis-based DTW to each), if it performs better. I think I may have it on my task list somewhere.


Anyway, if what you mainly need is to remove / ignore music, you can also achieve that with much less complexity (at least in potential) by first applying source separation (--isolate), which is designed to do just that, and then applying standard DTW alignment on the isolated voice.

The thing is that source separation with MDX-NET (only currently supported model architecture) is currently slow, not because the model is slow (with GPU acceleration, it's actually pretty fast), but because the FFT (STFT) the model requires has very long windows that are expensive to compute. The current implementation is a kind of prototype, in a sense, and it uses single-threaded WebAssembly based FFT and STFT (KissFFT).

I plan, in the future to transition ONNX based STFT (added in recent versions of the ONNX standard) which would also have native multi-threading. I could also, in that way, move a lot of the preprocessing and postprocessing to an ONNX model, running on native CPU/GPU, which would be faster.

If you want to see just how fast GPU accelerated source separation can be, and test the results with some more expensive models, I highly recommend to try:

https://github.com/Anjok07/ultimatevocalremovergui

There are also command-line based utilities that use the same models, like:

https://github.com/nomadkaraoke/python-audio-separator

@an-lee
Copy link
Author

an-lee commented Jul 23, 2024

I'm using Echogarden to build a language learning tool. So, I want not just a word-level timeline, but a token-level and even a phone-level timeline would be better.

Previously, I used whisper.cpp/OpenAI/etc. to get the transcript text and aligned the whole text with the audio using Echogarden. However, background noise and [music] transcripts could make the alignments far from accurate.

Now, I use whisper.cpp/OpenAI/etc. or even .srt and .vtt files to get the accurate boundaries of segments/sentences. Then, the alignSegments API helps me get the accurate token-level and phone-level timelines, which are exactly what I need.

Maybe this snapshot of my project can help explain what I need.

image

@an-lee
Copy link
Author

an-lee commented Jul 23, 2024

I also tried the --isolate option. It gives great results, but it takes too long. Plus, it crashes when dealing with large audio files.

@rotemdan
Copy link
Member

rotemdan commented Jul 23, 2024

If the text is generated via speech recognition, then the recognizer would produce word timestamps (no need for for alignment). So if you want phone-level timestamps, there are several ways to do it:

Extracting phone-level timestamps from each recognized word boundary

This would work if the word boundaries output by the recognizer are accurate. In the case of the whisper.cpp engine, they are mostly not, but in the case of the ONNX-based engine, they may be good enough. I can add that as an option (I think I already have that on my task list actually).

Your suggestion of passing the recognized timeline to dtw-ra could also achieve this, but I'm not sure what would you use as reference? If the intention is to pass the recognized text as reference, then the whole thing seems unnecessary, since dtw-ra was designed for cases where the recognized and reference transcript are different from each other. If they are the same, it may be easier to process every segment / sentence / word boundary individually, to add the more refined token and phone level timestamps.

Take native segment output from Whisper and apply alignment to each individually

The native output of the (original, unmodified) Whisper model (both whisper.cpp and ONNX versions) is a sequence of timestamped text segments that look like this:

197356445-311c8643-9397-4e5e-b46e-0b4b4daa2530-cropped

I can take each one of these and individually apply synthesis-based DTW alignment to them. That is a bit like what you're trying to do with alignSegments but could possibly have better results. In any case, it will use alignSegments in a very similar way, but would do it over the time ranges the model natively outputs, not interpolated or already aligned ones.

As I mentioned, this was considered the "base" (simple) approach. I had it implemented in previous versions but removed it later. I can put it back in.


I explained why the --isolate is slow, but I'll try to rephrase that again.

MDX-NET is originally distributed as an ONNX model. The version I use is the same one that is used by python-audio-separator. It's the exact same .onnx file published by the original authors.

The slowness is not because of the MDX-NET source separation model itself, but because the input to the model requires extracting FFT (STFT and ISTFT) on large windows with many overlaps, which, with the extra pre- and post-processing needed is currently more expensive than running the inference itself.

Also mdx-net models can run using dml and cuda providers, but it's last time I checked that is not stable. The instability is an issue with onnxruntime-node, not with my code.

I'm not sure that are the crashes you are describing with long inputs. If you're using a cpu provider maybe they are related to memory limits of Node.js. I can't know without more information.

Anyway, I gave links to two implementations that use more optimized FFT operations (I believe based on Pytorch FFT - not ONNX). On the faster models (like UVR_MDXNET_1_9703 which I use by default), they can process several minutes in a few seconds, with a good GPU.

Another reason MDX-NET is slow is because it is originally designed for extracting vocals (not particularly speech) from music tracks so quality is very important. Here, we don't really need high quality separation. We need just enough so that the speech audio features would be similar enough to the reference audio features. It's also possible the high overlap is not needed. For the prototype implementation I decided to keep the same overlap to ensure it produces similar outputs to the reference.

@an-lee
Copy link
Author

an-lee commented Jul 23, 2024

Thanks for the explanations! But I'm a little confused now.

Here are my detailed procedures:

  1. Use whisper.cpp (outside of Echogarden) to transcribe the audio. This gives me the word timestamps.
  2. Build a sentenceTimeline with the word timestamps based on the punctuation.
  3. Call the alignSegments API with the sentenceTimeline. Now I get a wordTimeline with phone-level timestamps.
  4. Insert the wordTimeline into the sentenceTimeline.timeline to build a complete sentence timeline, which is what I need.

So how can I improve these procedures? Any suggestions? Thank you.

@rotemdan
Copy link
Member

rotemdan commented Jul 24, 2024

  1. I don't know why you should use whisper.cpp outside of Echogarden. Echogarden can call any whisper.cpp main binary you give it (via whisperCpp.executablePath) and supports a lot of the essential options (except maybe things like grammars, but that could be added in the future if there's demand). Also, the token-to-text conversion Echogarden does is more accurate than the one done internally in whisper.cpp since it uses the official tiktoken library, and whisper.cpp uses a custom tokenizer that is not 100% compatible with the reference implementation.
  2. You shouldn't do that yourself. Echogarden uses a library (cldr-segmentation) for sentence segmentation that also works with many different languages and correctly handles things like numbers (24.645) and language-dependent abbreviations (Mr.).
  3. I mentioned a more correct way to align segments would be to use the audio ranges natively output from the whisper model (based on timestamp tokens it directly outputs). Applying alignSegments to sentences, in the case of whisper.cpp means that the start and end times of the range are only "guesses", and are not guaranteed to be accurate. Anyway, Echogarden doesn't do that right now, but I can add that as an option.

Basically what I said is that I intend to add options to handle your needs and use case, since they are generally useful. You seem to mostly just want to have phonemes in your recognition timeline. That's not a problem to add.

Right now the development is simply not as fast as in March-May of this year, where I made a huge number of improvements (released a new significant version almost every several days at some point) and spent hundreds of hours on development.

That's just the way it is sometimes. This project has never been funded in any way and had been developed with a 0 budget. There are more and less active periods. The last few months have been less active, but I still try to handle issues on the tracker.

@an-lee
Copy link
Author

an-lee commented Jul 24, 2024

  1. Understood, I'll try to use whisper.cpp via Echogarden;
  2. Indeed, there's a lot of edge cases to handle. I'll try cldr-segmentation instead;
  3. Understood.

I completely understand. Take your time please.

For now, I'll work on improving my procedures based on the current codebase.

Thank you so much.

@rotemdan
Copy link
Member

You mentioned you're possibly using cloud services for transcription, like OpenAI or Azure.

OpenAI Whisper service is $0.006 per minute, that's $0.36 per hour.
Azure is something like $0.18 per hour (batch transcription).

However, I was looking on LLM API pricing and I also noticed Groq's Whisper Large V3 (requires registration) is only $0.03/hour.

Groq also publishes these limits for my (unpayed) account (but I'm not sure if they are actually free? If they are - that's a lot):

Screenshot_1

Groq has an OpenAI compatible transcription endpoint at:

https://api.groq.com/openai/v1/audio/transcriptions

(which is also known to be very fast - here's a highly recommended site that publishes a leaderboard for cloud speech recognition services where Groq is rated very highly there on price/performance)

I'm not sure if it provides all the features of OpenAI's servers (say, including the accurate word timestamps). I'll need to test that.

I'll add support for custom endpoints for the openai-cloud engine, so it can be used in place of the much more (12x) expensive one by OpenAI.

So as you see, adding optional phoneme timestamps would also need to include the cloud engines, to get a more consistent coverage.

I would love to just be able to implement all of these features immediately, but of course sometimes even small changes take time and effort. Right now these conversations are mostly helping me to discover new issues and ideas and prioritize. (I do have partially-committed work-in-progress code - especially on adding machine translation integration - so that takes the highest priority right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alignment Issue related to forced alignment feature Issue proposes a new feature
Projects
None yet
Development

No branches or pull requests

2 participants