Replies: 1 comment
-
Deciding which implementation is better based on a single example is hard.
AFAIK, the sampling strategy in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I've just started playing with this tech and my audio samples are episodes of "Yes, Minister" a BBC sitcom from the 1980'ies that I rather love. The audio is rather clean: Just about everything is indoors with no background noise and no-one speaking on top of each other.
I have a 12 threaded Ryzen 2600 CPU, and no GPU. I've used the medium.en model with beam=5. I've not fiddled with any other options for the different implementations I've tried.
I tested whisper, whisper-cpp and whisper-faster. Whisper-faster was about "realtime". Whisper-cpp was about realtime x 2. I neglected to time whisper.
Whisper and whisper-faster were fair enough. Some simplified and Americanized language. Had problems catching things said in the background.
Whisper-cpp it produced astonishingly good subs (for my sample): All the words, no Americanizations, text of the (significant) sound from TVs and "[phone rings]", "[music]".
Having read the paper and looked around I still feel quite confused about how the same model could produce such different results. Is it all down to the inference engine? Or are there runtime parameters I could experiment with?
Beta Was this translation helpful? Give feedback.
All reactions