Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster batch size = 1 inference #77

Open
Moelf opened this issue Sep 30, 2022 · 12 comments
Open

Faster batch size = 1 inference #77

Moelf opened this issue Sep 30, 2022 · 12 comments

Comments

@Moelf
Copy link
Contributor

Moelf commented Sep 30, 2022

There are certain applications that needs to do one inference at time, for example in analyzing large data:
https://indico.bnl.gov/event/15089/contributions/68235/attachments/43511/73312/Moneta-ROOT-FutureAnalysis.pdf#page=25

how can we make it faster and less allocating, i'd be happy to work on it

@dfdx
Copy link
Collaborator

dfdx commented Oct 1, 2022

Could you please clarify what exactly needs to work faster? If you have an ONNX graph or a Julia graph that can be exported to ONNX, then the simplest way to execute it faster seems to run it using ONNXRunTime. Is there a reason it won't work?

@Moelf
Copy link
Contributor Author

Moelf commented Oct 1, 2022

It's designed to run on big batches, look at the slides

@dfdx
Copy link
Collaborator

dfdx commented Oct 1, 2022

Sorry, I still don't understand your proposal/request. This package is about conversion between Julia functions (mostly NNlib/Flux) and ONNX format. If you want an ONNX graph to run faster for single-item batches, then it might be better to post this issue at ONNXRuntime's repo (or repo of some other ONNX engine). If you want it to run faster on the Julia side, i.e. faster Umlaut.play!(tape), then we need at least some use cases to measure the performance and understand bottlenecks.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 2, 2022

yes, I want faster Ghost.play!() for size = 1

for example, can we pre-allocate memory?

@Moelf
Copy link
Contributor Author

Moelf commented Oct 2, 2022

ONNXRuntime's repo

the slides
image

explicitly shows them beating ONNXRuntime. But yes, I understand "we don't have much motivation to make Julia faster than ONNXRuntime" is a valid response if so I can close the issue

@ToucheSir
Copy link
Member

Reading through the documentation on SOFIE, it seems like they generate custom C++ code for every model. ONNX.jl used to do this (with Julia code), but it's really cumbersome and difficult to integrate into a normal workflow. I doubt the overhead from having to "interpret" ops instead of running source code is that high though, doubly so with the tape-based approach this package uses.

Personally, I would be interested in where some of the bottlenecks are for inference with Julia DL libraries. ONNX.jl uses a number of functions from NNlib, so improvements would likely be made there and then help out other packages as well.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 2, 2022

like they generate custom C++ code for every model.

indeed, idk why they re-invent stuff like that, I mean the NN primitives are just what they are.... I can only (uneducated) guess the overhead is mainly the tape playing (and the allocation model of this pipeline is optimized for bigger batch?)

@ToucheSir
Copy link
Member

Many of the default kernels in NNlib are not terribly optimized and could definitely be improved upon. NNPack used to provide optimized versions for a couple, but that was dropped at some point for correctness reasons AIUI. There was also an attempt to write optimized kernels in pure Julia, but I believe that stalled due to time and not having a way to work around LoopVectorization.jl's latency penalty.

@Moelf
Copy link
Contributor Author

Moelf commented Oct 2, 2022

I thought the non CUDA part of NNlib was already in Julia but I guess I'm wrong huh

@dfdx
Copy link
Collaborator

dfdx commented Oct 2, 2022

Compiling a tape is the easy part. You can already compile it into native Julia code using Umlaut.compile(tape), or you can generative custom Julia, C, CUDA or whatever else code.

The hard part is what optimizations you want to apply. Some use cases will benefit from graph structure optimization (which ONNXRuntime is remarkably good at, by the way). Others can be accelerated by kernel fusion. Yet others need buffers and in-place operations. Perhaps, the best way to attack this issue is to collect a set of real-life use cases and start experimenting.

(As a side note, recently I switched this package from Ghost to Umlaut due to weird dependency compatibility issue. This is a drop-in replacement though, so you can safely ignore the difference)

@ToucheSir
Copy link
Member

I thought the non CUDA part of NNlib was already in Julia

It is, but not all of the algorithms used are the most optimal. For example, the default conv algorithm (im2col) isn't very memory-efficient, and pooling runs single-threaded.

@Moelf
Copy link
Contributor Author

Moelf commented Nov 11, 2022

https://indico.cern.ch/event/1176076/contributions/4939648/attachments/2474114/4245117/SOFIE%40ICHEP.pdf#page=11

more slides for future reference, I can do a benchmark later to compare ONNX.jl to ONNXRuntime, but the slides is claiming faster than ONNXRuntime so...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants