Run encoder on Apple Neural Engine #548

wangchou · 2023-03-01T07:36:54Z

wangchou
Mar 1, 2023

For small whisper model, I observed 6x speed up on encoder when running it on Apple Neural Engine.

Encoder time per run
whisper.cpp: 1030ms (CPU, 4 threads)
CoreML model: 174ms (Apple Neural Engine)

I wonder is it a good idea to use CoreML model in whisper.cpp.
PS: The decoder part of whisper.cpp is much faster than CoreML because of kv_cache.

Tested on MacBook M1 Air 16GB

Python Conversion Script

import whisper
import torch
import coremltools as ct
from coremltools.models.neural_network import quantization_utils

# model setting
modelSize="small"
model = whisper.load_model(modelSize).cpu()
n_state = 768 # tiny=384, base=512, small=768, medium=1024, large=1280

# trace model by torch.jit
encoder = model.encoder
encoder.eval()

mel_segment = torch.randn((1, 80, 3000))
traced_encoder = torch.jit.trace(encoder, mel_segment)

# convert to coreml model
encoder = ct.convert(
    traced_encoder,
    inputs=[ct.TensorType(shape=mel_segment.shape)],
    compute_units=ct.ComputeUnit.ALL,
)

encoder_fp16 = quantization_utils.quantize_weights(encoder, nbits=16)
encoder_fp16.save(f"encoder_{modelSize}_fp16.mlmodel")

# test accuracy
torch_output = traced_encoder.forward(mel_segment)
print("torch model output:", torch_output)
x_1 = mel_segment.cpu().detach().numpy()
coreml_output = torch.from_numpy(
  list(encoder_fp16.predict({'x_1': x_1}).values())[0]
)
print(f"coreml {modelSize} model output:", coreml_output)
diff = torch.abs(torch_output - coreml_output).detach()
print("diff avg,max:", torch.mean(diff), torch.max(diff))

# note
# convertion time on Macbook M1 Air 16GB
# tiny: 28s
# small: 5 mins
# medium: 40 mins (29GB)
# large: crashed, use 60+GB memory after 23mins

ggerganov · 2023-03-01T08:18:13Z

ggerganov
Mar 1, 2023
Maintainer

From a quick search it looks like CoreML can be used through Objective-C. If we can write a simple Obj-C wrapper function that takes a mel segment, runs CoreML-based whisper encoder and outputs the resulting encoder embeddings, I think I can easily plug it in whisper.cpp.

Given the performance numbers that you observe - this will be a game changer!

I will probably take a look into this at some point, but since you already have things going - would appreciate any help on this.

17 replies

rsomani95 Mar 6, 2023

@wangchou I've done some work porting Whisper to ANE using the apple codebase. I have the encoder working. The decoder is a bit tricky because of lack of ANE support for dynamic shapes.

I have a couple of open issues on the CoreML git that are directly relevant to porting the decoder to ANE:
apple/coremltools#1764
apple/coremltools#1763

I'm cleaning up the code a bit, and will share whatever I have working in a few hours

wangchou Mar 6, 2023
Author

@rsomani95 Good to hear your works. I'm still struggling on understanding ml-ane-transformers repos.😓 If we export the kv_cache in constant size from decoder like @ggerganov suggested, do you think we could fixed the input token shape to (1, 1)?

ggerganov Mar 6, 2023
Maintainer

@wangchou
If you can export a decoder model that works with shape (1, 1), I can easily modify the inference in whisper.cpp to use it and feed the prompt token-by-token. It will not be optimal in terms of speed, but at least we can get an estimate about the ANE decoder performance.

If this works, we can try to export decoders with enumerated input shapes of power of 2: for example (1, 1), (1, 2), (1, 4), (1, 8) ... (1, 256). From @rsomani95 links, enumerated input should work with ANE. We then have to feed the prompt in powers of 2.
Again, not optimal, but pretty close to optimal.

rsomani95 Mar 6, 2023

Yeah it's a bit tricky! Took me a few days to wrap my head around it.

Sorry, I'm not sure I fully follow what you mean re. the decoder. On a higher level, do you mean passing only one token to the decoder at a time?

To respond to your other comment re. the [decoder]:

That is possible. I noticed it in HuggingFace's Whisper model.
They returns the kv_cache in each run, and passes past_key_value to next run. ( https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel.forward.past_key_values )
I wondered how much kv_cache could speed up the decoder when ANE needs to talk to CPU every 9ms.
And that codebase is too complex for me, so I didn't look into it. 😓

This is also done in the OpenAI-Whisper repo. I haven't studied the HF repo, but I suspect the mechanism is similar. OpenAI uses PyTorch hooks to do this where they dynamically keep track of the kv_cache and modify the model at runtime. It makes my head hurt a bit tbh. As far as I'm aware, this behavior is not possible to replicate in CoreML.
Here's the relevant kv_caching code from OpenAI's repo:

Also see this thread from @hollance and @vade on the topic:
https://twitter.com/mhollemans/status/1620806746318196739?s=61&t=DuLRkH6QCYXgg4Umq-fdGQ

The only way I see the decoder working is if each block is exported separately, and its forward function is modified to accept the input values from the kv_cache. This would need to be managed externally and as you mentioned, the overhead to keep passing the tensors from CoreML <-> CPU might not be worth it.
That being said, I'd be very curious to know if our assumptions are right.

wangchou Mar 6, 2023
Author

@rsomani95 Sorry for my misleading text. Just like what your said in last paragraph. I means if we converted a coreml encoder model that supports exporting kv_cache and accept past_key_value as input. For that coreml model, the input tokens shape could always be (1, 1). So we could avoid flexible shape is not supported by ANE issue.

@ggerganov I will try to convert a model with fixed input and kv_cache support. But it may take several days, I'm still playing around Whisper's repo slowly. I have more interest on @rsomani95 's work. He ported model to ml-ane-transformer's architecture. It should speed up coreml model a lot.

rsomani95 · 2023-03-06T08:09:58Z

rsomani95
Mar 6, 2023

Check out the whisper_ane project.
I've released the small, tiny, and base variants for the encoder. There's 4 exports of each (stock and ANE variants, at fp16 and fp32 variants), available in the latest release: https://github.com/Synopsis/whisper_ane/releases/tag/v0.1.0-rc1

cc @ggerganov @wangchou

1 reply

rsomani95 Mar 6, 2023

Here's the benchmarks I ran on my system (16 inch M1 Max MBP 2021, 64GB RAM, MacOS 13.0).

Key observations from the XCode benchmarks:

All FP16 exports (base and optimised arch) run on ANE 100%
All FP32 exports (base and optimised arch) run on GPU 100%
In FP16, the optimised arch is significantly faster
In FP32, the base arch is significantly faster

Caveats:

It's unclear how exactly XCode runs their benchmark. We do not know how the performance scales at larger batch sizes.
Though the FP32 base model looks the fastest per XCode, we may see a different result when running the optimised ANE model on larger batch sizes as it is 100% ANE accelerated

RobertRiachi · 2023-03-08T02:54:32Z

RobertRiachi
Mar 8, 2023

Hey folks! Awesome work :) I was made aware of this thread by @ggerganov after a conversation we had on Twitter.

Long story short I've optimized both Whisper's encoder and decoder to run on Apple's Neural Engine a couple weeks back, and have hacked flexible sized inputs for the decoder (though not recommended lol). I've done this twice, once on-top of huggingface's implementation of Whisper, and I've published a version built on-top of OpenAI's implementation:
https://github.com/RobertRiachi/ANE-Optimized-Whisper-OpenAI

I can validate @rsomani95's benchmarks as I too get similar fp32 encoder prediction performance :)

Speeding up the current encoder

Quantizing to fp16 and using the standard LLM data format of (batch, seq, embed_dim) actually slows down prediction time, so with a few changes we can get even more performance out of @wangchou 's idea!

The current implementation uses the standard LLM data format of (batch, seq, embd_dim) but the neural engine's most conductive data format is 4D and channels first. We also want the last axis to be the sequence since the last axis of the ANE buffer isn't packed, and must be contiguous and aligned to 64 bytes. This only applies to the last axis, and since we're quantizing to fp16 the neural engine is actually padding it up to 64bytes which results in 32 times the memory cost for 16bit precision.

TLDR; By switching to (batch, embed_dim, 1, seq) we can further improve the speed of the encoder.

Decoder & Kvcaching

Decoding a (1,1) token with an optimized ANE decoder model ran prediction at best 16ms which is still slower than the 7s currently achieved on CPU.

I've spent a good amount of time attempting to figure out a solution to the kvcaching problem, the fundamental issue is that cormel models are unable to branch thus making this difficult. We could export two versions of the decoder, one that's not expecting a kvcache for the first token and another that can handle the kvcache case, but that's pretty gross.

Quantization

I actually haven't noticed any performance gains by quantizing to fp16 from fp32, the prediction speed is roughly equivalent. Using fp16 instead of fp32 actually slows down compilation time by roughly 2x in all my tests. I suspect this has to do with how the quantize_weights method in coremltools throws an error if you specify "mlprogram" as the model type in your ct.convert call, and if no convert_to argument is provided coremltools seems to default to creating "NeuralNetwork proto" instead of a "MILSpec.Program proto" (source).

36 replies

vade Jul 11, 2023

FWIW, I have some Apple Radar / Feedback assistant bugs for ANE Compiler service

FB11926774 - ANECompilerService runs every time XCode compiles my project, rather than once and caching the result.
FB12038163 - Flexible input shapes not Neural Engine Accelerated?

If anyone files any additional CoreML / CoreMLTools Bugs with Apple, ive heard rumors linking to other issues helps the team isolate.

ggerganov Jul 11, 2023
Maintainer

@wangchou

I tried to apply the new LayerNorm in #1097
I guess I'm missing something - the results with base.en are corrupted

wangchou Jul 12, 2023
Author

@ggerganov Sorry, maybe the issue is data-dependent. I only tested transcribed text is fixed on small, medium and large on one minute long audio. In openai/whisper's code, they convert data to float32 before LayerNorm and Softmax. If I force coreml use float32 on these ops, I do observe some output difference drop 10 times. But the prediction time for small model grows: 110ms -> 155ms (LayerNorm on GPU) -> 307ms (LayerNorm and Softmax on GPU).

example of forcing ops runs on fp32 by Graph Passes Option

pipeline = ct.PassPipeline()
pipeline.set_options("common::add_fp16_cast", {"skip_ops_by_type": "layer_norm,softmax"})
    
encoder = ct.convert(
    #...
    pass_pipeline=pipeline,
)

PS: This is just another hypothesis. I didn't test it yet. I will test encoder on various inputs this week.

wangchou Jul 14, 2023
Author

@ggerganov I put encoder conversion script at https://github.com/wangchou/whisper/tree/for_whisper_cpp . I rewrite the AudioEncoder. Encoder is divided into sub-models for speed up ANECompilerService. I use CVPixelBuffer to passing data between sub-models, so prediction time is the same. And the conversion time of the large model is reduced to 24s by skip_model_load option.

PS: I didn't find base.en issue with samples/gb0.wav when transcribing it. (new version of AudioEncoder)

ggerganov Jul 25, 2023
Maintainer

@wangchou

Thanks - will give this a try, though I'm bit overloaded + vacation soon.
Would be great if someone tries to plug it in whisper.cpp and measure the performance and conversion time compared to the existing Core ML encoder. If not, I will do it after I come back from the vacation sometime in August

vade · 2023-03-08T18:59:09Z

vade
Mar 8, 2023

We got some feedback from the author of the Apple Neural Engine optimizations for Transformers on our GitHub issue in CoreMLTools who offered some advice for using ct.EnumeratedShapes and more. Posting here for visibility:

apple/coremltools#1763 (comment)

0 replies

CTres · 2023-03-15T22:28:54Z

CTres
Mar 15, 2023

as a founder in a venture backed ML team building in this space - this is some highly encouraging work fellas! Email is open: [email protected]

0 replies

mblyk · 2023-03-23T18:32:20Z

mblyk
Mar 23, 2023

Did someone try to run the core ml model on iOS?
I encounter this error:

[espresso] [Espresso::handle_ex_plan] exception=ANECF error:
[coreml] Error plan build: -1.

It happens when ANECompilerService exceeds CPU usage during the first run.

CPU:              90 seconds CPU time over 91 seconds (99% CPU average), exceeding the limit of 50% CPU over 180 seconds
CPU limit:        90s
Limit duration:   180s
CPU used:         90s
CPU duration:     91s

It doesn't relate directly to the model's size; the same crash happens with the base and small models.
Can the core ml model utilize less CPU during the first launch?

1 reply

ephemer May 15, 2023

This is the main issue I am seeing as well, with models of any size. I get it more than 50% of the time when running the model on my mac. This looks like something we could file an Apple Feedback report about, but I'm not sure how right now

vade · 2023-03-23T18:35:58Z

vade
Mar 23, 2023

CoreML complication result can be triggered once and loaded manually via some additional CoreML runtime calls. But yes, this is annoying as hell and I think theres an Xcode bug which triggers multiple re-compliations even on successive runs with no changes.

AFAIK - Building and deploying for devices should NOT incur the ANE compilation step if you build your app correctly.

See https://developer.apple.com/documentation/coreml/mlmodel/3931182-compilemodel and

0 replies

jasontitus · 2023-04-08T15:40:26Z

jasontitus
Apr 8, 2023

What is the process folks are using to convert to coreml models? I tried both https://github.com/wangchou/callCoreMLFromCpp and https://gist.github.com/RobertRiachi/d75bf6946bb8f1cea391c3c03a4ba4db and they both throw an assert -

assert x.shape[1:] == self.positional_embedding.shape[::-1], "incorrect audio shape"

And the output files don't recognize correctly (only gets a single word on the JFK example).

Just looking to produce coreml versions of the medium and large models since I have a Mac Studio with 128GB of RAM to process them. Happy to upload them once I get something working.

5 replies

RobertRiachi Apr 12, 2023

Hey it seems to be working fine on a fresh install, can you confirm what versions of the required packages you're using? I'm able to get it to work with:

torch==2.0.0
whisper==1.1.10
coremltools==6.2
latest from pip install ane_transformers

jasontitus Apr 13, 2023

With those package versions, I get the following -
python3 Whisper_ANE_export.py
Torch version 2.0.0 has not been tested with coremltools. You may run into unexpected errors. Torch 1.13.1 is the most recent version that has been tested.
Traceback (most recent call last):
File "./Whisper_ANE_export.py", line 11, in
from whisper.model import Whisper, AudioEncoder, TextDecoder, ResidualAttentionBlock, MultiHeadAttention, ModelDimensions
ModuleNotFoundError: No module named 'whisper.model'; 'whisper' is not a package

RobertRiachi Apr 14, 2023

You're missing the whisper package from openai

pip install -U openai-whisper

github repo if you need more instructions: https://github.com/openai/whisper

Aeon Apr 17, 2023

@RobertRiachi FWIW, when I try to install the packages with these versions, pip install ane_transformers uninstalls torch 2.0.0 and re-installs torch 1.11.0, since it's pinned at https://github.com/apple/ml-ane-transformers/blob/main/requirements.txt

Are you using a different fork by chance?

I found the fork by KE7 that moves torch dependency to 2.0.0, but I'm not sure if there's a reason why it was pinned at 1.11.0, and if that reason is addressed in their fork.

Tried to install their fork, but I still get the following message when I run ./models/generate-coreml-model.sh medium command:

ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=1024, n_audio_head=16, n_audio_layer=24, n_vocab=51865, n_text_ctx=448, n_text_state=1024, n_text_head=16, n_text_layer=24)
/Users/agf/Library/Python/3.9/lib/python/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
....

RobertRiachi Apr 18, 2023

Hey @Aeon you can safely ignore the ane_transformers 1.11.0 requirement for this to work. The easiest thing to do would be to just pip install ane_transformers first, then pip install torch==2.0 and ignore the fork.

Not sure about the generate-coreml-model.sh but I just updated Jason on a bug and how to fix it for exporting model sizes that aren't small. Let me know if you run into other issues :)

jasontitus · 2023-04-14T04:48:20Z

jasontitus
Apr 14, 2023

Nope - I had it installed last time. Just was quickly trying to get the same moduleset and installed base on the list in the last message. With openai-whisper it got further, but changing: whisper = load_model("small").cpu() to whisper = load_model("medium").cpu() caused it to error out on the conversion - RuntimeError: Given groups=1, weight of size [1024, 1024, 1, 1], expected input[1, 768, 1, 1500] to have 1024 channels, but got 768 channels instead Are you actually able to convert the medium model? That is all I am shooting for.

…

On Thu, Apr 13, 2023 at 7:27 PM Robert Riachi ***@***.***> wrote: You're missing the whisper package from openai pip install -U openai-whisper github repo if you need more instructions: https://github.com/openai/whisper — Reply to this email directly, view it on GitHub <#548 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUOXQJP2UTDMIW7BMH6CTXBCYXHANCNFSM6AAAAAAVLV4I2I> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

1 reply

RobertRiachi Apr 18, 2023

Hey @jasontitus sorry for the late response been swamped recently, you actually found a bug. The audio_shape on line 255 is hardcoded for the small sized model by mistake. You can resolve this by changing that line to audio_shape = (1, 1024, 1, 1500) and it should convert the medium model successfully. I just tried it and confirmed I got the same error as you, but after that fix I was able to convert the medium sized model successfully.

This is just a temporary work around, when I get the chance I'll update the script so the audio_shape is dynamic to the models embedding dim. Let me know if you still have any issues.

mrfragger · 2023-07-10T22:10:39Z

mrfragger
Jul 10, 2023

Thanks flexchar for uploading those generated core ML models. It seems to be running faster than 3x. I hope at least..gotta wait till 30+ hour audiobook finishes to be sure. I'm using the medium.en model and it took a little over 16 hours to run it the first time. Now it's processing a batch of audiobooks so we'll see just how fast it actually is. Before was gettting 2x - 3x on Mac M1 8GB RAM. I doubt I have enough RAM for large model though and that's ok. Since this is using 82% RAM currently. Small model wasn't accurate enough for me for locations, cities, etc.

1 reply

vadi2 Aug 1, 2023

What result did you have in the end?

Run encoder on Apple Neural Engine #548

Replies: 10 comments · 62 replies

ggerganov Mar 1, 2023 Maintainer

wangchou Mar 6, 2023 Author

ggerganov Mar 6, 2023 Maintainer

wangchou Mar 6, 2023 Author

Speeding up the current encoder

Decoder & Kvcaching

Quantization

ggerganov Jul 11, 2023 Maintainer

wangchou Jul 12, 2023 Author

wangchou Jul 14, 2023 Author

ggerganov Jul 25, 2023 Maintainer

Replies: 10 comments 62 replies

ggerganov
Mar 1, 2023
Maintainer

wangchou Mar 6, 2023
Author

ggerganov Mar 6, 2023
Maintainer

wangchou Mar 6, 2023
Author

ggerganov Jul 11, 2023
Maintainer

wangchou Jul 12, 2023
Author

wangchou Jul 14, 2023
Author

ggerganov Jul 25, 2023
Maintainer