Whisper speaker diarization returns NaN for timestamps #1077

patrick-ve · 2024-12-07T19:11:02Z

System Info

Node.js v20.10.0
@huggingface/transformers v3.1.0
Chrome 131

Environment/Platform

Description

I have cloned the example repo and made some changes to the transformers dependencies, as I immediately got an error upon installing dependencies.

I have replaced "@xenova/transformers": "github:xenova/transformers.js#v3" with "@huggingface/transformers": "^3.1.0". Also replaced the import in worker.js.

After loading the model and attempting to transcribe the example video, a table gets logged, where each start and end segment is NaN.

Reproduction

Clone my repository, install dependencies and run the dev server. Use the following steps:

git clone https://github.com/patrick-ve/speaker-diarization-example
yarn install
yarn dev

The text was updated successfully, but these errors were encountered:

patrick-ve · 2024-12-08T09:41:41Z

I managed to figure out that property post_process_speaker_diarization does not exist on type Processor, which causes NaN to appear as start/end values.

Follow up question: where would post_process_speaker_diarization be imported from?

xenova · 2024-12-08T11:40:38Z

Thanks for debugging! I've identified the problem, which #1082 will fix.

I'll also open a follow-up PR to improve the unit tests for the model.

patrick-ve · 2024-12-08T12:09:48Z

Thanks for your quick reply. A quick workaround I discovered earlier would be to downgrade to @huggingface/transformers: "3.0.0".

I have another small question that is related to this model. How would one track progress of the actual speaker diarization progress? The example code provided mentions a progress_callback property, but this handles tracking progress of loading the models:

    this.segmentation_processor ??= AutoProcessor.from_pretrained(
      this.segmentation_model_id,
      {
        progress_callback,
      }
    );

I thought that adding a callback_function property to the transcribe instance would be enough, but this doesn't seem to work with a AutoModelForAudioFrameClassification model. I'm looking for something similar to this other transcribing example.

xenova · 2024-12-08T12:15:27Z

Closed in 14bf689 👍

To respond to your question on progress: The model processes the audio all at once, and the post processing acts on the model output, so while there's technically no way to do it within the modelling code, you should be able to do something similar as follows:

Split the audio into 10 second chunks (potentially overlapping). This is actually what the model was designed for
Process the chunks separately. This is where you can track progress.
Post-process all the chunks and merge.

Note that you would need to employ some form of speaker identification model into the pipeline to ensure that the same speaker is assigned across chunks (since there won't be any consistency across chunks)

patrick-ve · 2024-12-08T13:03:33Z

Thanks for the insight, it is much appreciated! I have decided to reduce complexity by not implementing speaker identification.

As of now, I process the first 10 seconds of the audio, which returns the elapsed time. I then multiply this number by the amount of chunks to get a very rough estimated time to completion. I then show a loading bar that progresses over this estimated time. This should be enough as an enhanced user experience.

patrick-ve added the bug Something isn't working label Dec 7, 2024

xenova closed this as completed Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper speaker diarization returns NaN for timestamps #1077

Whisper speaker diarization returns NaN for timestamps #1077

patrick-ve commented Dec 7, 2024

patrick-ve commented Dec 8, 2024

xenova commented Dec 8, 2024

patrick-ve commented Dec 8, 2024

xenova commented Dec 8, 2024 •

edited

Loading

patrick-ve commented Dec 8, 2024

Whisper speaker diarization returns NaN for timestamps #1077

Whisper speaker diarization returns NaN for timestamps #1077

Comments

patrick-ve commented Dec 7, 2024

System Info

Environment/Platform

Description

Reproduction

patrick-ve commented Dec 8, 2024

xenova commented Dec 8, 2024

patrick-ve commented Dec 8, 2024

xenova commented Dec 8, 2024 • edited Loading

patrick-ve commented Dec 8, 2024

xenova commented Dec 8, 2024 •

edited

Loading