Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate vDSP Mel Spectrogram support, Sample Accrual, macOS support, Add Tokenizer - WIP Branch #2

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

vade
Copy link

@vade vade commented Jan 2, 2023

Hello

This is a WIP port which attempts to remove the Rust MEL spectrogram implementation and use native vDSP / Accelerate.

I've opened a PR mostly as a WIP to have a place to discuss the work done!

Status: Incomplete, but close. Need some community help. We solved the repeating token problem, but aren't getting sensible output because our Mel Spectrogram isnt matching Torches exactly.

Work to date:

  • I've added a macOS SwiftUI implementation. I've updated the main method of the Whisper implementation to take a URL to an asset. The new decode method accrues a set of samples for a segment of Whisper transcription, and runs it, then continues to accrue samples.

  • I've created a Log Mel spectrogram implementation with vDSP, and numerically checked it against Whispers audio loading and normalization code. This code is close, but different enough to be causing incorrect output in Whisper. I have verified that we get correct output if we import a correct Log Mel as generated by Python natively.

  • Ive updated the CoreML Export script to output flexible shapes on the decoders token input. It is my understanding that we need to pass an 'accrued' number of tokens. Ie we start with the SOT token, predict on our segment of audio, get a new token, append that to a running token list, and run the decoder again with a tensor of both tokens.

  • Ive created a Tokenizer based off of GPT2Tokenizer implementation from Hugging Faces swift transformers repo, and generously borrowed some code from that repo to help. Ive implemented a very simple greedy top-1 token strategy.

Overview:

Right now, this repo:

  • creates an AVAssetReader, audio output, and an audio mix output which forces a single channel, 16Khz Sint16 output.
  • We decode sample buffers, and as we decode a vector of samples, we accrue them until we hit the number of samples Whisper expects in a segment.
  • We create a Log Mel Spectrogram from the audio (or attempt to do so correctly)
  • We encode it to audio features using the Whisper encoder model
  • We then prime with a SOT token, and send the single token and the audio features to the Whisper Decoder.
  • This results in our first predicted logits, which we use to find the top 1 probably token, and save that. We keep iterating this way, and sending an array of Int tokens to our decoder along with the same audio segment, until we hit a EOT token.

Work left to do:

  • Triple check the Log Mel code. You can use this iPython Notebook to generate some test data with your own audio the same way Whisper does, to verify things smell correct numerically. I believe this is correct, and pretty fast. I based it off of Apples MEL code, with some examples. It could use some minor clean up.

  • Implement a better tokenizer. The Whisper Tokenizer leverages some 'special tokens', and my presumption is they are ignored in the decode BPE pass, sort of like 'additional logic' but Im not entirely sure what needs to happen to properly implement them in Swift.

  • Fix token repetition from the decoder. I think the issue has to do with not having the equivalent of "additional_special_tokens" method in the custom tokenizer. it seems like we need to init the model with multiple tokens?

  • Time stamps, additional quality of life stuff

Thanks for any insight! Happy new year!

vade added 28 commits January 2, 2023 12:42
…ls dont appear to want to run on iOS though, too large?
…matrix order bug thanks to actually commenting our code.
…token decoding loop. Not working yet, but we are super close.
…s than Whisper / FFMPEG sint16 + normalization. Trying to figure out numerical compatibilities here.
@vade vade changed the title Accelerate vDSP Mel Spectrogram support - WIP Branch Accelerate vDSP Mel Spectrogram support, Sample Accrual, macOS support, Add Tokenizer - WIP Branch Jan 8, 2023
@sahilshah
Copy link

hey were you able to get this across the finish line? if not what did you about getting a fast mel spectogram for the whisper model calls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants