Faster batch size = 1 inference #77

Moelf · 2022-09-30T14:05:18Z

There are certain applications that needs to do one inference at time, for example in analyzing large data:
https://indico.bnl.gov/event/15089/contributions/68235/attachments/43511/73312/Moneta-ROOT-FutureAnalysis.pdf#page=25

how can we make it faster and less allocating, i'd be happy to work on it

dfdx · 2022-10-01T21:23:30Z

Could you please clarify what exactly needs to work faster? If you have an ONNX graph or a Julia graph that can be exported to ONNX, then the simplest way to execute it faster seems to run it using ONNXRunTime. Is there a reason it won't work?

Moelf · 2022-10-01T21:44:34Z

It's designed to run on big batches, look at the slides

dfdx · 2022-10-01T22:04:33Z

Sorry, I still don't understand your proposal/request. This package is about conversion between Julia functions (mostly NNlib/Flux) and ONNX format. If you want an ONNX graph to run faster for single-item batches, then it might be better to post this issue at ONNXRuntime's repo (or repo of some other ONNX engine). If you want it to run faster on the Julia side, i.e. faster Umlaut.play!(tape), then we need at least some use cases to measure the performance and understand bottlenecks.

Moelf · 2022-10-02T00:16:14Z

yes, I want faster Ghost.play!() for size = 1

for example, can we pre-allocate memory?

Moelf · 2022-10-02T01:56:14Z

ONNXRuntime's repo

the slides

explicitly shows them beating ONNXRuntime. But yes, I understand "we don't have much motivation to make Julia faster than ONNXRuntime" is a valid response if so I can close the issue

ToucheSir · 2022-10-02T02:37:12Z

Reading through the documentation on SOFIE, it seems like they generate custom C++ code for every model. ONNX.jl used to do this (with Julia code), but it's really cumbersome and difficult to integrate into a normal workflow. I doubt the overhead from having to "interpret" ops instead of running source code is that high though, doubly so with the tape-based approach this package uses.

Personally, I would be interested in where some of the bottlenecks are for inference with Julia DL libraries. ONNX.jl uses a number of functions from NNlib, so improvements would likely be made there and then help out other packages as well.

Moelf · 2022-10-02T02:40:01Z

like they generate custom C++ code for every model.

indeed, idk why they re-invent stuff like that, I mean the NN primitives are just what they are.... I can only (uneducated) guess the overhead is mainly the tape playing (and the allocation model of this pipeline is optimized for bigger batch?)

ToucheSir · 2022-10-02T02:58:56Z

Many of the default kernels in NNlib are not terribly optimized and could definitely be improved upon. NNPack used to provide optimized versions for a couple, but that was dropped at some point for correctness reasons AIUI. There was also an attempt to write optimized kernels in pure Julia, but I believe that stalled due to time and not having a way to work around LoopVectorization.jl's latency penalty.

Moelf · 2022-10-02T04:56:15Z

I thought the non CUDA part of NNlib was already in Julia but I guess I'm wrong huh

dfdx · 2022-10-02T08:43:22Z

Compiling a tape is the easy part. You can already compile it into native Julia code using Umlaut.compile(tape), or you can generative custom Julia, C, CUDA or whatever else code.

The hard part is what optimizations you want to apply. Some use cases will benefit from graph structure optimization (which ONNXRuntime is remarkably good at, by the way). Others can be accelerated by kernel fusion. Yet others need buffers and in-place operations. Perhaps, the best way to attack this issue is to collect a set of real-life use cases and start experimenting.

(As a side note, recently I switched this package from Ghost to Umlaut due to weird dependency compatibility issue. This is a drop-in replacement though, so you can safely ignore the difference)

ToucheSir · 2022-10-02T18:30:30Z

I thought the non CUDA part of NNlib was already in Julia

It is, but not all of the algorithms used are the most optimal. For example, the default conv algorithm (im2col) isn't very memory-efficient, and pooling runs single-threaded.

Moelf · 2022-11-11T14:36:27Z

https://indico.cern.ch/event/1176076/contributions/4939648/attachments/2474114/4245117/SOFIE%40ICHEP.pdf#page=11

more slides for future reference, I can do a benchmark later to compare ONNX.jl to ONNXRuntime, but the slides is claiming faster than ONNXRuntime so...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster batch size = 1 inference #77

Faster batch size = 1 inference #77

Moelf commented Sep 30, 2022 •

edited

Loading

dfdx commented Oct 1, 2022

Moelf commented Oct 1, 2022

dfdx commented Oct 1, 2022

Moelf commented Oct 2, 2022 •

edited

Loading

Moelf commented Oct 2, 2022 •

edited

Loading

ToucheSir commented Oct 2, 2022

Moelf commented Oct 2, 2022 •

edited

Loading

ToucheSir commented Oct 2, 2022

Moelf commented Oct 2, 2022

dfdx commented Oct 2, 2022

ToucheSir commented Oct 2, 2022

Moelf commented Nov 11, 2022

Faster batch size = 1 inference #77

Faster batch size = 1 inference #77

Comments

Moelf commented Sep 30, 2022 • edited Loading

dfdx commented Oct 1, 2022

Moelf commented Oct 1, 2022

dfdx commented Oct 1, 2022

Moelf commented Oct 2, 2022 • edited Loading

Moelf commented Oct 2, 2022 • edited Loading

ToucheSir commented Oct 2, 2022

Moelf commented Oct 2, 2022 • edited Loading

ToucheSir commented Oct 2, 2022

Moelf commented Oct 2, 2022

dfdx commented Oct 2, 2022

ToucheSir commented Oct 2, 2022

Moelf commented Nov 11, 2022

Moelf commented Sep 30, 2022 •

edited

Loading

Moelf commented Oct 2, 2022 •

edited

Loading

Moelf commented Oct 2, 2022 •

edited

Loading

Moelf commented Oct 2, 2022 •

edited

Loading