[unlimited:waifu2x] Multithreading is possible but not configured properly #34

LoganDark · 2023-05-05T04:02:51Z

Problem

ONNX runtime supports multithreaded model execution, and it will automatically be enabled.

However, that can only happen when SharedArrayBuffer is available, which requires these HTTP headers to be set:

Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

https://unlimited.waifu2x.net does not send these headers, so ONNX runtime cannot use multiple threads. I will perform an experiment to show that this is a mistake.

Experiment

I will add these headers for testing by using a Chrome extension.

These headers will make SharedArrayBuffer available, and ONNX runtime will automatically use multiple threads.

Parameters for the experiment

Model: swin_unet.art_scan
Denoise: 3 (highest)
Scale: 1 (1x)
Tile size: 256 (console: tile size = 256)
TTA level: 0 (disabled)
Detect alpha: false (no alpha channel)
Size of the image: 42 tiles

Performed using the version of unlimited:waifu2x that is currently live at https://unlimited.waifu2x.net.

Result of the experiment

Chromium

1 main thread that performs the execution (no changes)

388556.5769042969 ms (approx. 9251.347069149926 ms per tile)
12 worker threads that perform the execution (with headers)

143964.38818359375 ms (approx. 3427.723528180805 ms per tile)

Using 12 threads divides the time taken by 2.698977030408252, a 2.7x improvement.

Firefox

1 main thread that performs the execution (no changes)

~~DNF (slow); 109955ms for 3 tiles; estimated 1539370ms for 42 tiles (approx. 36651ms per tile)~~
169983ms (approx. 4047.214285714286ms per tile)
12 worker threads that perform the execution (enabled dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled)

~~DNF (slow); 147402ms for 18 tiles; estimated 343938ms for 42 tiles (approx. 8189ms per tile)~~
58643ms (approx. 1396.261904761905ms per tile)

Using 12 threads divides the time taken by ~~4.475719461065657~~ 2.898606824343911, a 2.9x improvement, even larger than Chromium.

Implementation steps

Instruct the server to send the required HTTP headers
Define ort.env.wasm.numThreads = navigator.hardwareConcurrency before initialization, or else it will default to only 4 threads
Enjoy the free speedup

The text was updated successfully, but these errors were encountered:

LoganDark · 2023-05-05T04:44:54Z

It's astonishing how fast unlimited:waifu2x can get in Firefox with 12 threads. Seems Firefox really is the best at WebAssembly JIT.

It's possible to make it even faster by making the models compatible with ONNX runtime's WebGL or WebGPU backends, so that they can be executed on the GPU, just like with CUDA. In fact, the WebGPU backend might already be compatible (but I have not looked into this yet)

Compatibility mostly consists of removing operators that WebGL doesn't support, like ConstantOfShape. Some optimizers can already recognize and remove these. utils/pad.onnx pictured below, official on left, optimized on right:

But you also have to adjust int64 values so that they fit in 32 bits (utils/alpha_border_padding.onnx pictured below):

This can be done manually in a python debugger (as I have successfully done for some models). And not all the actual int64 types have to be converted to int32 (although the int64 casts need to be removed), they just need to fit in an int32.

I have successfully gotten some of the utility models to load in ONNX runtime's WebGL backend, but unfortunately, this isn't very useful because the utility models are mostly precalculations, and the most expensive part is the actual upscaling model, which uses operators that WebGL doesn't support, namely ConstantOfShape, Where, Expand.

I'm also looking into seeing if I can actually add support for ConstantOfShape into the WebGL backend myself, but of course this is not very easy since I cannot build ONNX runtime from source yet. Maybe I will modify the minified JS (hehehe....). My personal version of unlimited:waifu2x is based on a TypeScript translation/rewrite of reverse engineered minified code.

nagadomi · 2023-05-05T05:36:17Z

Thank you for sharing.

Multithreading
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

I tried this before but did not applied it as it was slower than the original code on chrome.
(I first tried ort.env.wasm.numThreads=4 but it didn't seem to work, so I tried microsoft/onnxruntime#9681 )
I may need to try again.

WebGL

I gave up on using WebGL backend because of the many unsupported functions.
(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

WebGPU

I tried it recently but it did not work yet. microsoft/onnxruntime#15796

As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.
It would be nice to have WebAssembly backend faster for users who don't have a GPU.

LoganDark · 2023-05-05T05:40:52Z

I tried this before but did not applied it as it was slower than the original code on chrome.

This is clearly not true anymore

(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

I also tried modifying that script. But int32 conversion is not required, only reducing magnitude of the values. And full conversion causes the model to fail validation anyway, because some operators require int64 attributes.

I tried it recently but it did not work yet. microsoft/onnxruntime#15796

Good to know~

As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.

You're right, it doesn't. This issue itself is about WASM multithreading, not WebGL (that was just a slightly related comment).

It would be nice to have WebAssembly backend faster for users who don't have a GPU.

Absolutely

nagadomi · 2023-05-05T05:54:47Z

Also pytorch version (cli/server and training) is running on 16-bit float (half float).
If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

LoganDark · 2023-05-05T06:00:43Z

If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

It should be sufficient to convert the input tensor to float16 and back each time you run the model. You can probably use these converter functions and use Uint16Array tensors as float16. Then use a model that expects float16. I will probably perform my own experiments once my codebase is functional

nagadomi · 2023-05-05T08:44:29Z

OK, I have confirmed that it is faster with multithreading.

// google-chrome
// default (test on unlimited.waifu2x.net)
tile size = 256
script.js:38 render: 38275.5 ms
tile size = 256
script.js:38 render: 38714.466064453125 ms

// numThreads=16 (test on localhost)
tile size = 256
script.js:489 render: 12700.81005859375 ms
tile size = 256
script.js:489 render: 12487.656005859375 ms

// Firefox
// default
tile size = 256 script.js:28:27
render: 35854ms - タイマー終了
tile size = 256 script.js:28:27
render: 35756ms - タイマー終了

// numThreads=16
tile size = 256 script.js:362:17
render: 12039.94ms - タイマー終了
tile size = 256 script.js:362:17
render: 11316.12ms - タイマー終了

I may have made that mistake before, as it gets very slow when DevTools is open.

One thing that is not great is that all javascript files must be hosted locally to enable SharedArrayBuffer.

LoganDark · 2023-05-05T08:45:22Z

all javascript files must be hosted locally

You mean vendored (on your server that serves the correct HTTP headers)? You should have been doing that anyway. You should not depend on CDNs for your website's main functionality. You host the models on your server so why not host the runtime to execute them?

nagadomi · 2023-05-05T08:48:08Z

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

LoganDark · 2023-05-05T08:48:54Z

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

Why not? Can't you vendor those scripts as well?

nagadomi · 2023-05-05T08:54:07Z

It needs to load scripts from third-party servers.
related to https://stackoverflow.com/questions/68683903/is-there-a-way-to-use-google-adsense-with-cross-origin-isolation

LoganDark · 2023-05-05T08:58:05Z

If you are ok with only being compatible with Chrome 96 and higher, setting Cross-Origin-Embedder-Policy: credentialless should work to keep google ads functional.

https://chromestatus.com/feature/4918234241302528

The header works on my chrome and SharedArrayBuffer exists with it.

But this does not enable multithreading in firefox (firefox does not support it).

Also try adding the crossorigin attribute to the script tag, it probably won't work but is worth a try maybe.

…default) related to #34

nagadomi · 2023-05-05T13:12:30Z

For now, I have not been able to get Adsense to work with cross-origin isolation env.
I registered the website to Chrome Origin Trials (SharedArrayBuffer) and it works on chrome.

LoganDark · 2023-05-07T16:53:02Z

Is there any way to get firefox support as well?

- Inference is defined as an async method, but, it blocks. After a couple days of trying all avenues and looking at sample apps, it looks like it is synchronous in that it will consume the attention of the thread the `await session.run` is called on. - Using Squadron to handle multi-threading didn't work. Now that the JS function in index.html is loading the model and passing it to a worker, it's possible it might. - In any case, this shows exactly how to set up a worker that A) does inference without blocking UI rendering B) allows Dart code to `await` the result without blocking UI - This process was frustrating and fraught, there's a surprising lack of info and examples around ONNX web. Most seem to consume it via diffusers.js/transformers.js. ONNX web was a separate library from the rest of the ONNX runtime until sometime around late 2022. The examples still use that library, and the examples use simple enough models that it's hard to catch whether they are blocking without falling back to dev tools. - Its absolutely crucial when debugging speed locally to make sure you're loading the ONNX version you expect (i.e. wasm AND threaded AND simd). The easiest way to check is network loads in Dev Tools, sort by size, and look for the .wasm file to A) be loaded B) include wasm, simd, and threaded in the filename. - Two things can prevent that: -- CORS nonsense with Flutter serving itself in debug mode: --- see here, nagadomi/nunif#34 --- note that the extension became adware, you should have Chrome set up its permissions such that it isn't run until you click it. Also, note that you have to do that each time the Flutter web app in debug mode's port changes. -- MIME type issues --- Even after that, I would see errors in console logs about the MIME type of the .wasm being incorrect and starting with the wrong bytes. That, again, seems due to local Flutter serving of the web app. To work around that, you can download the WASM files from the same CDN folder that hosts ort.min.js (see worker.js) and also in worker.js, remove the // in front of ort.env.wasm.wasmPaths = "". That indicates you've placed the WASM files next to index.html, which you should. Note you just need the 4 .wasm files, no more, from the CDN. Some performance review notes: - `webgpu` as execution provider completely errors out, says "JS executor not supported in the ONNX version" (1.16.3) - `webgl` throws "Cannot read properties of null (reading 'irVersion')" - Tested perf by varying wasm / simd / thread and thread count on M2 MacBook Air 16 GB ram, Chrome 120 - Landed on simd & thread count = 1/2 of cores as best performing -- first # is minilm l6v2, second is minilm l6v3, average inference time for 200 / 400 words -- 4 threads: 526 ms / 2196 ms -- simd 4 threads: 86 ms / 214 ms -- simd 8 threads: 106 ms / 260 ms -- simd 128 threads: 2879 ms / skipped -- simd navigator.hardwareConcurrency threads (8): 107 ms / 222 ms

nagadomi added a commit that referenced this issue May 5, 2023

waifu2x: unlimited: Support for wasm multi-threading (Not enabled by …

149f17a

…default) related to #34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[unlimited:waifu2x] Multithreading is possible but not configured properly #34

[unlimited:waifu2x] Multithreading is possible but not configured properly #34

LoganDark commented May 5, 2023 •

edited

Loading

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023 •

edited

Loading

nagadomi commented May 5, 2023 •

edited

Loading

LoganDark commented May 5, 2023 •

edited

Loading

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023 •

edited

Loading

nagadomi commented May 5, 2023

LoganDark commented May 7, 2023

[unlimited:waifu2x] Multithreading is possible but not configured properly #34

[unlimited:waifu2x] Multithreading is possible but not configured properly #34

Comments

LoganDark commented May 5, 2023 • edited Loading

Problem

Experiment

Parameters for the experiment

Result of the experiment

Chromium

Firefox

Implementation steps

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023 • edited Loading

nagadomi commented May 5, 2023 • edited Loading

LoganDark commented May 5, 2023 • edited Loading

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023

nagadomi commented May 5, 2023

LoganDark commented May 5, 2023 • edited Loading

nagadomi commented May 5, 2023

LoganDark commented May 7, 2023

LoganDark commented May 5, 2023 •

edited

Loading

LoganDark commented May 5, 2023 •

edited

Loading

nagadomi commented May 5, 2023 •

edited

Loading

LoganDark commented May 5, 2023 •

edited

Loading

LoganDark commented May 5, 2023 •

edited

Loading