feat: add Whispering support #454

charles-zablit · 2022-10-14T14:08:27Z

This PR adds support for Whispering a streaming transcription server based on OpenAI's Whisper.

Whispering's advantage over VOSK is that it supports multiple languages detection and transcription.

The Whispering Transcription service uses WebSockets to communicate with the Whispering server.

This is still a WIP as we still need to fix a sample rate incompatibility issue between Whispering and Jigasi.
Right now, we have to set EXPECTED_AUDIO_LENGTH to 25600.
We also have to change https://github.com/shirayu/whispering/blob/256bf38b4d3d751e1eac8116f0f7da07e1b9652f/whispering/serve.py#L69
to audio = np.frombuffer(message, dtype=np.int64)

nikvaessen · 2022-10-14T16:03:40Z

Some questions:

Have you tested if real-time transcription is feasible?
What model (tiny/base/etc) are you planning on running?
On what GPU are you planning to run this?
Any thoughts on serving transcriptions from one machine to multiple meetings?

charles-zablit · 2022-10-17T08:25:00Z

Have you tested if real-time transcription is feasible?

It works just as fast as VOSK, however it only starts transcribing after the sentence ends. It does not have partial results, which might make it look slow.

What model (tiny/base/etc) are you planning on running?

Currently we have tested both medium and large, with very good performances.

On what GPU are you planning to run this?

We have run our tests on t1-45 OVH VPS, so an NVIDIA Tesla V100.

Any thoughts on serving transcriptions from one machine to multiple meetings?

We have not tested that yet, but it seems that Whispering supports multiple connections.
GPU usage is around 30% on our OVH instance for 1 connection, so multiple connections are doable.

If you want, we plan on presenting our findings at today's Jitsi community call.

codecov · 2022-10-17T09:17:45Z

Codecov Report

Merging #454 (e645d49) into master (dda0721) will decrease coverage by 0.76%.
The diff coverage is 0.00%.

❗ Current head e645d49 differs from pull request most recent head d8f88ae. Consider uploading reports for the commit d8f88ae to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #454      +/-   ##
============================================
- Coverage     23.15%   22.39%   -0.77%     
  Complexity      304      304              
============================================
  Files            69       70       +1     
  Lines          5812     6006     +194     
  Branches        790      804      +14     
============================================
- Hits           1346     1345       -1     
- Misses         4235     4430     +195     
  Partials        231      231

Impacted Files	Coverage Δ
...rc/main/java/org/jitsi/jigasi/AbstractGateway.java	`68.60% <0.00%> (-11.13%)`	⬇️
.../java/org/jitsi/jigasi/AbstractGatewaySession.java	`63.49% <0.00%> (-4.31%)`	⬇️
src/main/java/org/jitsi/jigasi/JvbConference.java	`44.28% <0.00%> (-1.39%)`	⬇️
src/main/java/org/jitsi/jigasi/Main.java	`22.09% <0.00%> (-1.66%)`	⬇️
...c/main/java/org/jitsi/jigasi/rest/HandlerImpl.java	`0.00% <0.00%> (ø)`
...jigasi/transcription/VoskTranscriptionService.java	`0.00% <0.00%> (ø)`
.../transcription/WhisperingTranscriptionService.java	`0.00% <0.00%> (ø)`
...in/java/org/jitsi/jigasi/sounds/PlaybackQueue.java	`54.38% <0.00%> (-1.76%)`	⬇️
.../jitsi/jigasi/sounds/SoundNotificationManager.java	`29.62% <0.00%> (+0.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4964b52...d8f88ae. Read the comment docs.

nikvaessen · 2022-10-17T11:33:23Z

src/main/java/org/jitsi/jigasi/transcription/WhisperingTranscriptionService.java

+            ctx.put("no_speech_threshold", 0.6);
+            ctx.put("buffer_threshold", 0.5);
+            ctx.put("vad_threshold", 0.5);
+            ctx.put("data_type", "float32");


if using GPU, I expect it should be a bit faster with float16. But most time is spent waiting for audio, I guess.

My bad this was supposed to be int64, as, from my understanding, this is the audio format Jigasi sends.

I have implemented a convertion to float32 in shirayu/whispering#36 but I will suggest float16 for better performances.

davidak · 2023-03-22T02:47:07Z

@charles-zablit do you have a plan to finish this?

It would be a great feature as i think whisper is currently the best open source STT. I would like to use it for meeting notes.

cryolite-ai · 2023-10-23T20:50:13Z

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java
which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

damencho · 2023-10-23T21:16:20Z

in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up...

Where do you see this?

cryolite-ai · 2023-10-24T06:39:38Z

Link to source file was in my last post - here it is again:

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java
which appears to be connected to PR #491

See line 27...

rpurdel · 2023-10-25T13:48:26Z

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

Hi,

We are still in the very early stage with our own Whisper live transcription implementation. We plan to make it open-source in the not so distant future.

Cheers,
Razvan

rpurdel · 2024-02-08T11:27:35Z

@charles-zablit @nikvaessen @damencho

The whisper live transcription server is now open source under the jitsi/skynet project. It should work out of the box with Jigasi.

charles-zablit added 3 commits October 14, 2022 15:49

feat: add Whispering support

298c1e5

chore: edit README

d537a6c

chore: fix checkstyle violations

7e1a4ee

charles-zablit mentioned this pull request Oct 17, 2022

feat: add datatype in context shirayu/whispering#36

Merged

nikvaessen reviewed Oct 17, 2022

View reviewed changes

chore: adapt new Whispering context

d8f88ae

charles-zablit force-pushed the feat-implement-whispering branch from e645d49 to d8f88ae Compare October 17, 2022 11:49

charles-zablit mentioned this pull request Oct 20, 2022

feat: add a resampling feature shirayu/whispering#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Whispering support #454

feat: add Whispering support #454

charles-zablit commented Oct 14, 2022 •

edited

Loading

nikvaessen commented Oct 14, 2022 •

edited

Loading

charles-zablit commented Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

nikvaessen Oct 17, 2022 •

edited

Loading

charles-zablit Oct 17, 2022

davidak commented Mar 22, 2023

cryolite-ai commented Oct 23, 2023 •

edited

Loading

damencho commented Oct 23, 2023

cryolite-ai commented Oct 24, 2023

rpurdel commented Oct 25, 2023

rpurdel commented Feb 8, 2024 •

edited

Loading

feat: add Whispering support #454

Are you sure you want to change the base?

feat: add Whispering support #454

Conversation

charles-zablit commented Oct 14, 2022 • edited Loading

nikvaessen commented Oct 14, 2022 • edited Loading

charles-zablit commented Oct 17, 2022 • edited Loading

codecov bot commented Oct 17, 2022 • edited Loading

Codecov Report

nikvaessen Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

charles-zablit Oct 17, 2022

Choose a reason for hiding this comment

davidak commented Mar 22, 2023

cryolite-ai commented Oct 23, 2023 • edited Loading

damencho commented Oct 23, 2023

cryolite-ai commented Oct 24, 2023

rpurdel commented Oct 25, 2023

rpurdel commented Feb 8, 2024 • edited Loading

charles-zablit commented Oct 14, 2022 •

edited

Loading

nikvaessen commented Oct 14, 2022 •

edited

Loading

charles-zablit commented Oct 17, 2022 •

edited

Loading

codecov bot commented Oct 17, 2022 •

edited

Loading

nikvaessen Oct 17, 2022 •

edited

Loading

cryolite-ai commented Oct 23, 2023 •

edited

Loading

rpurdel commented Feb 8, 2024 •

edited

Loading