-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate vDSP Mel Spectrogram support, Sample Accrual, macOS support, Add Tokenizer - WIP Branch #2
Open
vade
wants to merge
47
commits into
tanmayb123:master
Choose a base branch
from
vade:feature/AccelerateMEL
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ls dont appear to want to run on iOS though, too large?
…er / clearer what needs to happen
…matrix order bug thanks to actually commenting our code.
…token decoding loop. Not working yet, but we are super close.
…s than Whisper / FFMPEG sint16 + normalization. Trying to figure out numerical compatibilities here.
vade
changed the title
Accelerate vDSP Mel Spectrogram support - WIP Branch
Accelerate vDSP Mel Spectrogram support, Sample Accrual, macOS support, Add Tokenizer - WIP Branch
Jan 8, 2023
…er show it tokens it doesnt know about. the code is terrible. bear with me.
…e need to get our tokenizer strategies sorted.
…had the wrong tokens, and wrong decoder settings. Uggghhhhhhhhh newb shit.
…SPSplitComplex object and do maths frmo it.
…ListPointer. Clean up some CMSampleBuffer code.
…sing repeating / bad values in our MEL output.
hey were you able to get this across the finish line? if not what did you about getting a fast mel spectogram for the whisper model calls? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello
This is a WIP port which attempts to remove the Rust MEL spectrogram implementation and use native vDSP / Accelerate.
I've opened a PR mostly as a WIP to have a place to discuss the work done!
Status: Incomplete, but close. Need some community help. We solved the repeating token problem, but aren't getting sensible output because our Mel Spectrogram isnt matching Torches exactly.
Work to date:
I've added a macOS SwiftUI implementation. I've updated the main method of the Whisper implementation to take a URL to an asset. The new decode method accrues a set of samples for a segment of Whisper transcription, and runs it, then continues to accrue samples.
I've created a Log Mel spectrogram implementation with vDSP, and numerically checked it against Whispers audio loading and normalization code. This code is close, but different enough to be causing incorrect output in Whisper. I have verified that we get correct output if we import a correct Log Mel as generated by Python natively.
Ive updated the CoreML Export script to output flexible shapes on the decoders token input. It is my understanding that we need to pass an 'accrued' number of tokens. Ie we start with the SOT token, predict on our segment of audio, get a new token, append that to a running token list, and run the decoder again with a tensor of both tokens.
Ive created a Tokenizer based off of GPT2Tokenizer implementation from Hugging Faces swift transformers repo, and generously borrowed some code from that repo to help. Ive implemented a very simple greedy top-1 token strategy.
Overview:
Right now, this repo:
Work left to do:
Triple check the Log Mel code. You can use this iPython Notebook to generate some test data with your own audio the same way Whisper does, to verify things smell correct numerically. I believe this is correct, and pretty fast. I based it off of Apples MEL code, with some examples. It could use some minor clean up.
Implement a better tokenizer. The Whisper Tokenizer leverages some 'special tokens', and my presumption is they are ignored in the decode BPE pass, sort of like 'additional logic' but Im not entirely sure what needs to happen to properly implement them in Swift.
Fix token repetition from the decoder. I think the issue has to do with not having the equivalent of "additional_special_tokens" method in the custom tokenizer. it seems like we need to init the model with multiple tokens?
Time stamps, additional quality of life stuff
Thanks for any insight! Happy new year!