-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster batch size = 1 inference #77
Comments
Could you please clarify what exactly needs to work faster? If you have an ONNX graph or a Julia graph that can be exported to ONNX, then the simplest way to execute it faster seems to run it using ONNXRunTime. Is there a reason it won't work? |
It's designed to run on big batches, look at the slides |
Sorry, I still don't understand your proposal/request. This package is about conversion between Julia functions (mostly NNlib/Flux) and ONNX format. If you want an ONNX graph to run faster for single-item batches, then it might be better to post this issue at ONNXRuntime's repo (or repo of some other ONNX engine). If you want it to run faster on the Julia side, i.e. faster |
yes, I want faster for example, can we pre-allocate memory? |
Reading through the documentation on SOFIE, it seems like they generate custom C++ code for every model. ONNX.jl used to do this (with Julia code), but it's really cumbersome and difficult to integrate into a normal workflow. I doubt the overhead from having to "interpret" ops instead of running source code is that high though, doubly so with the tape-based approach this package uses. Personally, I would be interested in where some of the bottlenecks are for inference with Julia DL libraries. ONNX.jl uses a number of functions from NNlib, so improvements would likely be made there and then help out other packages as well. |
indeed, idk why they re-invent stuff like that, I mean the NN primitives are just what they are.... I can only (uneducated) guess the overhead is mainly the tape playing (and the allocation model of this pipeline is optimized for bigger batch?) |
Many of the default kernels in NNlib are not terribly optimized and could definitely be improved upon. NNPack used to provide optimized versions for a couple, but that was dropped at some point for correctness reasons AIUI. There was also an attempt to write optimized kernels in pure Julia, but I believe that stalled due to time and not having a way to work around LoopVectorization.jl's latency penalty. |
I thought the non CUDA part of NNlib was already in Julia but I guess I'm wrong huh |
Compiling a tape is the easy part. You can already compile it into native Julia code using The hard part is what optimizations you want to apply. Some use cases will benefit from graph structure optimization (which ONNXRuntime is remarkably good at, by the way). Others can be accelerated by kernel fusion. Yet others need buffers and in-place operations. Perhaps, the best way to attack this issue is to collect a set of real-life use cases and start experimenting. (As a side note, recently I switched this package from Ghost to Umlaut due to weird dependency compatibility issue. This is a drop-in replacement though, so you can safely ignore the difference) |
It is, but not all of the algorithms used are the most optimal. For example, the default conv algorithm (im2col) isn't very memory-efficient, and pooling runs single-threaded. |
more slides for future reference, I can do a benchmark later to compare ONNX.jl to ONNXRuntime, but the slides is claiming faster than ONNXRuntime so... |
There are certain applications that needs to do one inference at time, for example in analyzing large data:
https://indico.bnl.gov/event/15089/contributions/68235/attachments/43511/73312/Moneta-ROOT-FutureAnalysis.pdf#page=25
how can we make it faster and less allocating, i'd be happy to work on it
The text was updated successfully, but these errors were encountered: