Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper speaker diarization returns NaN for timestamps #1077

Closed
1 of 5 tasks
patrick-ve opened this issue Dec 7, 2024 · 5 comments
Closed
1 of 5 tasks

Whisper speaker diarization returns NaN for timestamps #1077

patrick-ve opened this issue Dec 7, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@patrick-ve
Copy link

System Info

Node.js v20.10.0
@huggingface/transformers v3.1.0
Chrome 131

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

I have cloned the example repo and made some changes to the transformers dependencies, as I immediately got an error upon installing dependencies.

I have replaced "@xenova/transformers": "github:xenova/transformers.js#v3" with "@huggingface/transformers": "^3.1.0". Also replaced the import in worker.js.

After loading the model and attempting to transcribe the example video, a table gets logged, where each start and end segment is NaN.

Screenshot 2024-12-07 at 20 06 55

Reproduction

Clone my repository, install dependencies and run the dev server. Use the following steps:

git clone https://github.com/patrick-ve/speaker-diarization-example
yarn install
yarn dev
@patrick-ve patrick-ve added the bug Something isn't working label Dec 7, 2024
@patrick-ve
Copy link
Author

I managed to figure out that property post_process_speaker_diarization does not exist on type Processor, which causes NaN to appear as start/end values.

Follow up question: where would post_process_speaker_diarization be imported from?

@xenova
Copy link
Collaborator

xenova commented Dec 8, 2024

Thanks for debugging! I've identified the problem, which #1082 will fix.

I'll also open a follow-up PR to improve the unit tests for the model.

@patrick-ve
Copy link
Author

Thanks for your quick reply. A quick workaround I discovered earlier would be to downgrade to @huggingface/transformers: "3.0.0".

I have another small question that is related to this model. How would one track progress of the actual speaker diarization progress? The example code provided mentions a progress_callback property, but this handles tracking progress of loading the models:

    this.segmentation_processor ??= AutoProcessor.from_pretrained(
      this.segmentation_model_id,
      {
        progress_callback,
      }
    );

I thought that adding a callback_function property to the transcribe instance would be enough, but this doesn't seem to work with a AutoModelForAudioFrameClassification model. I'm looking for something similar to this other transcribing example.

@xenova
Copy link
Collaborator

xenova commented Dec 8, 2024

Closed in 14bf689 👍

To respond to your question on progress: The model processes the audio all at once, and the post processing acts on the model output, so while there's technically no way to do it within the modelling code, you should be able to do something similar as follows:

  1. Split the audio into 10 second chunks (potentially overlapping). This is actually what the model was designed for
  2. Process the chunks separately. This is where you can track progress.
  3. Post-process all the chunks and merge.

Note that you would need to employ some form of speaker identification model into the pipeline to ensure that the same speaker is assigned across chunks (since there won't be any consistency across chunks)

@xenova xenova closed this as completed Dec 8, 2024
@patrick-ve
Copy link
Author

Thanks for the insight, it is much appreciated! I have decided to reduce complexity by not implementing speaker identification.

As of now, I process the first 10 seconds of the audio, which returns the elapsed time. I then multiply this number by the amount of chunks to get a very rough estimated time to completion. I then show a loading bar that progresses over this estimated time. This should be enough as an enhanced user experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants