Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbundling oneDNN #183

Open
h-vetinari opened this issue Nov 25, 2021 · 14 comments
Open

Unbundling oneDNN #183

h-vetinari opened this issue Nov 25, 2021 · 14 comments

Comments

@h-vetinari
Copy link
Member

Breaking out the discussion from #175:

@njzjz: Currently tensorflow here is not built against MKL (OneDNN), but do we plan to?

@xhochy: Tensorflow is building oneDNN from source in the build currently. That is probably the first 1-2h of CI. Getting that from the feedstock instead of building from source would help a lot.

I looked at how upstream builds this, and - annoyingly - it's a bit of a mess. There seem to be three different versions of onednn being pulled in, with one of them (0.21.3) being quite old and not available in conda-forge.

I'm happy to start a PR that tries to rip things and see how far we get, but just wanted to get an opinion (@xhochy?) if replacing the upstream version is even realistic given the different versions? Or maybe we just remove the most important one?

@h-vetinari
Copy link
Member Author

Here are the build files for the different versions (the versions being manually set in the bazel scripts don't match the artefacts they're pulling in).

@ngam
Copy link
Contributor

ngam commented Nov 27, 2021

Maybe I am a bit late or uninformed about this, but the issue with onednn stuff is 1) kinda not worth it in my testing and 2) it is actually really easy to link them after the fact, you just need to declare an env variable TF_ENABLE_ONEDNN_OPTS, but I am not sure if that's actually as easy as it seems to be. I have tended to add

  - onednn # optimization for intel onednn library
  - pip
  - pip:
    - tensorflow # pip install for 2.7.0 (cpu)
    - tensorflow-gan


variables:
  TF_ENABLE_ONEDNN_OPTS: 1 # enables onednn optimization

at the end of my env.yml file (for cpus; I usually get tensorflow-gpu from conda for GPUs). Does the build need to be modified for this env variable to work? I remember it working fine with the conda-forge tensorflow, though I usually get the CPU version from pip (as above).

See the intel info for reference here.

@ngam
Copy link
Contributor

ngam commented Nov 27, 2021

Maybe a simple test can also be added to ensure this? I usually test by running this python file... which is from their "sanity check", but if you look closely, it seems like a phony test. Anyway, the Intel optimization for tensorflow is very underwhelming. I really don't think Intel is serious about this stuff, and their efforts seem to be haphazard and super low-energy. Both their devcloud as well as the whole oneapi stuff are a total mess. (You can browser their github repos to see what they are up to...)

import tensorflow as tf

import os

def get_mkl_enabled_flag():

    mkl_enabled = False
    major_version = int(tf.__version__.split(".")[0])
    minor_version = int(tf.__version__.split(".")[1])
    if major_version >= 2:
        if minor_version < 5:
            from tensorflow.python import _pywrap_util_port
        else:
            from tensorflow.python.util import _pywrap_util_port
            onednn_enabled = int(os.environ.get('TF_ENABLE_ONEDNN_OPTS', '0'))
        mkl_enabled = _pywrap_util_port.IsMklEnabled() or (onednn_enabled == 1)
    else:
        mkl_enabled = tf.pywrap_tensorflow.IsMklEnabled()
    return mkl_enabled

print ("We are using Tensorflow version", tf.__version__)
print("MKL enabled :", get_mkl_enabled_flag())
print(tf.config.list_physical_devices())
print(tf.test.is_gpu_available())
print(tf.test.gpu_device_name())

@xhochy
Copy link
Member

xhochy commented Nov 29, 2021

@ngam This issue is not about using/not using the oneDNN ops but whether we built oneDNN as part of the tensorflow build or whether we can just link against the existing oneDNN package.

@ngam
Copy link
Contributor

ngam commented Nov 29, 2021

@ngam This issue is not about using/not using the oneDNN ops but whether we built oneDNN as part of the tensorflow build or whether we can just link against the existing oneDNN package.

Oooh… thanks for explaining. Well, the latter makes much more sense (linking against existing one) especially in light of what I said about the actual utility of it. Then you can test if the linking works…

@ngam
Copy link
Contributor

ngam commented Apr 28, 2022

I would like to revisit my statement above. I finally could see the advantage of onednn optimization to speed up code by 50% for smaller loads, but up to %400 for more sustained loads. Also, the AVX-512 targeting seems to play a key role — that alone can result in significant speedup.

Interestingly, the tensorflow from conda-forge (2.7.0) prints out

2022-04-28 12:42:42.979985: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA

but usually tensorflow from PyPI or Intel prints out

2022-04-28 12:45:23.794234: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA

so... seemingly, we have better optimization here? This results in a bit of performance improvement (i.e. comparing v2.7.0 from conda-forge to PyPI or Intel). My current case is one with a lot of intermediate mapping (and dealing with very large files).

Edit: NGC containers from NVIDIA behave very similarly to conda-forge's tensorflow, including the small, but noticeable, speedup.

@h-vetinari
Copy link
Member Author

Interestingly, the tensorflow from conda-forge (2.7.0) prints out

[...]:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA

Do we know how this is handled for older CPUs (e.g. without AVX512) - gracefully or do things crash?

Before we have something like conda-forge/conda-forge.github.io#1261, we shouldn't be compiling things that hard-require more than SSE4 (I think...). Or perhaps AVX can be the new baseline?

As an example, faiss (both upstream and in conda-forge) does fat binaries that compile both against SSE4 and AVX2 and load the latter if possible at runtime, but that's not a scalable approach IMO. See conda-forge/faiss-split-feedstock#23

@ngam
Copy link
Contributor

ngam commented Apr 29, 2022

Do we know how this is handled for older CPUs (e.g. without AVX512) - gracefully or do things crash?

Works perfectly fine!

Actually, not only does it work perfectly fine, for some reason, it is still a bit faster (even on machines without avx-512) than the Intel and PyPI tensorflows.

@ngam
Copy link
Contributor

ngam commented Apr 29, 2022

I don't know if this is mainly to do with my esoteric data pipeline (shuffling ~100s GB of data, so the CPU performance actually matters in my case even when running on GPUs). I did a few more tests to see if it is consistent across different machines (CPU-only, GPU-CPU, different architectures) and it help up

@h-vetinari
Copy link
Member Author

Works perfectly fine!

That's great news - seems that this is built with fall-backs in place, which isn't always the case (without guards, if you try to run AVX2 code on a non-AVX2 machine, you'll normally crash loudly).

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

@ngam
Copy link
Contributor

ngam commented Apr 29, 2022

Works perfectly fine!

That's great news - seems that this is built with fall-backs in place, which isn't always the case (without guards, if you try to run AVX2 code on a non-AVX2 machine, you'll normally crash loudly).

yep try pip install intel-tensorflow-avx512...

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

Same, just without the AVX512

2022-04-28 22:43:22.335803: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

mamba list | grep tensorflow
tensorflow                2.7.0           cuda112py310he87a039_0    conda-forge
tensorflow-base           2.7.0           cuda112py310h2bd284a_0    conda-forge
tensorflow-estimator      2.7.0           cuda112py310h922d117_0    conda-forge
lscpu | grep avx
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

@h-vetinari
Copy link
Member Author

Out of curiosity, what does cpu_feature_guard.cc print on a non-AVX512 machine?

Same, just without the AVX512

That's kind of the point - it has some dynamic feature detection and presumably uses this behind the scenes to load the correct libs/symbols. 🙃

@ngam
Copy link
Contributor

ngam commented Apr 29, 2022

How do you make sense of the above with this line in build.sh?

export CC_OPT_FLAGS="-march=nocona -mtune=haswell"

@ngam
Copy link
Contributor

ngam commented Apr 29, 2022

Also, in the future should we follow nvidia's build? Notable differences are:

  • CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell"
  • TF_ENABLE_XLA=1
export TF_NEED_CUDA=1
export TF_NEED_CUTENSOR=1
export TF_NEED_TENSORRT=1
export TF_CUDA_PATHS=/usr,/usr/local/cuda
export TF_CUDA_VERSION=$(echo "${CUDA_VERSION}" | cut -d . -f 1-2)
export TF_CUBLAS_VERSION=$(echo "${CUBLAS_VERSION}" | cut -d . -f 1)
export TF_CUDNN_VERSION=$(echo "${CUDNN_VERSION}" | cut -d . -f 1)
export TF_NCCL_VERSION=$(echo "${NCCL_VERSION}" | cut -d . -f 1)
export TF_TENSORRT_VERSION=$(echo "${TRT_VERSION}" | cut -d . -f 1)
export TF_ENABLE_XLA=1
export TF_NEED_HDFS=0
if [ "${TARGETARCH}" = "amd64" ] ; then export CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell" ; fi
if [ "${TARGETARCH}" = "arm64" ] ; then export CC_OPT_FLAGS="-march=armv8-a" ; fi
export TF_USE_CCACHE

# This is the flags that Google use internally with clang.
# Adding them help make the compilation error more readable even if g++ doesn't do exactly as clang.
# Google use -Werror too. But we can't use it as the error from g++ isn't exactly the same and will trigger useless compilation error.
export CC_OPT_FLAGS="$CC_OPT_FLAGS -Wno-address-of-packed-member -Wno-defaulted-function-deleted -Wno-enum-compare-switch -Wno-expansion-to-defined -Wno-ignored-attributes -Wno-ignored-qualifiers -Wno-inconsistent-missing-override -Wno-int-in-bool-context -Wno-misleading-indentation -Wno-potentially-evaluated-expression -Wno-psabi -Wno-range-loop-analysis -Wno-return-std-move -Wno-sizeof-pointer-div -Wno-sizeof-array-div -Wno-string-concatenation -Wno-tautological-constant-compare -Wno-tautological-type-limit-compare -Wno-tautological-undefined-compare -Wno-tautological-unsigned-zero-compare -Wno-tautological-unsigned-enum-zero-compare -Wno-undefined-func-template -Wno-unused-lambda-capture -Wno-unused-local-typedef -Wno-void-pointer-to-int-cast -Wno-uninitialized-const-reference -Wno-compound-token-split -Wno-ambiguous-member-template -Wno-char-subscripts -Wno-error=deprecated-declarations -Wno-extern-c-compat -Wno-gnu-alignof-expression -Wno-gnu-variable-sized-type-not-at-end -Wno-implicit-int-float-conversion -Wno-invalid-source-encoding -Wno-mismatched-tags -Wno-pointer-sign -Wno-private-header -Wno-sign-compare -Wno-signed-unsigned-wchar -Wno-strict-overflow -Wno-trigraphs -Wno-unknown-pragmas -Wno-unused-const-variable -Wno-unused-function -Wno-unused-private-field -Wno-user-defined-warnings -Wvla -Wno-reserved-user-defined-literal -Wno-return-type-c-linkage -Wno-self-assign-overloaded -Woverloaded-virtual -Wnon-virtual-dtor -Wno-deprecated -Wno-invalid-offsetof -Wimplicit-fallthrough -Wno-final-dtor-non-final-class -Wno-c++20-designator -Wno-register -Wno-dynamic-exception-spec"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants