-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbundling oneDNN #183
Comments
Here are the build files for the different versions (the versions being manually set in the bazel scripts don't match the artefacts they're pulling in). |
Maybe I am a bit late or uninformed about this, but the issue with onednn stuff is 1) kinda not worth it in my testing and 2) it is actually really easy to link them after the fact, you just need to declare an env variable - onednn # optimization for intel onednn library
- pip
- pip:
- tensorflow # pip install for 2.7.0 (cpu)
- tensorflow-gan
variables:
TF_ENABLE_ONEDNN_OPTS: 1 # enables onednn optimization at the end of my env.yml file (for cpus; I usually get tensorflow-gpu from conda for GPUs). Does the build need to be modified for this env variable to work? I remember it working fine with the conda-forge tensorflow, though I usually get the CPU version from pip (as above). See the intel info for reference here. |
Maybe a simple test can also be added to ensure this? I usually test by running this python file... which is from their "sanity check", but if you look closely, it seems like a phony test. Anyway, the Intel optimization for tensorflow is very underwhelming. I really don't think Intel is serious about this stuff, and their efforts seem to be haphazard and super low-energy. Both their devcloud as well as the whole oneapi stuff are a total mess. (You can browser their github repos to see what they are up to...) import tensorflow as tf
import os
def get_mkl_enabled_flag():
mkl_enabled = False
major_version = int(tf.__version__.split(".")[0])
minor_version = int(tf.__version__.split(".")[1])
if major_version >= 2:
if minor_version < 5:
from tensorflow.python import _pywrap_util_port
else:
from tensorflow.python.util import _pywrap_util_port
onednn_enabled = int(os.environ.get('TF_ENABLE_ONEDNN_OPTS', '0'))
mkl_enabled = _pywrap_util_port.IsMklEnabled() or (onednn_enabled == 1)
else:
mkl_enabled = tf.pywrap_tensorflow.IsMklEnabled()
return mkl_enabled
print ("We are using Tensorflow version", tf.__version__)
print("MKL enabled :", get_mkl_enabled_flag())
print(tf.config.list_physical_devices())
print(tf.test.is_gpu_available())
print(tf.test.gpu_device_name()) |
@ngam This issue is not about using/not using the oneDNN ops but whether we built oneDNN as part of the tensorflow build or whether we can just link against the existing oneDNN package. |
Oooh… thanks for explaining. Well, the latter makes much more sense (linking against existing one) especially in light of what I said about the actual utility of it. Then you can test if the linking works… |
I would like to revisit my statement above. I finally could see the advantage of onednn optimization to speed up code by 50% for smaller loads, but up to %400 for more sustained loads. Also, the AVX-512 targeting seems to play a key role — that alone can result in significant speedup. Interestingly, the tensorflow from conda-forge (2.7.0) prints out
but usually tensorflow from PyPI or Intel prints out
so... seemingly, we have better optimization here? This results in a bit of performance improvement (i.e. comparing v2.7.0 from conda-forge to PyPI or Intel). My current case is one with a lot of intermediate mapping (and dealing with very large files). Edit: NGC containers from NVIDIA behave very similarly to conda-forge's tensorflow, including the small, but noticeable, speedup. |
Do we know how this is handled for older CPUs (e.g. without AVX512) - gracefully or do things crash? Before we have something like conda-forge/conda-forge.github.io#1261, we shouldn't be compiling things that hard-require more than SSE4 (I think...). Or perhaps AVX can be the new baseline? As an example, faiss (both upstream and in conda-forge) does fat binaries that compile both against SSE4 and AVX2 and load the latter if possible at runtime, but that's not a scalable approach IMO. See conda-forge/faiss-split-feedstock#23 |
Works perfectly fine! Actually, not only does it work perfectly fine, for some reason, it is still a bit faster (even on machines without avx-512) than the Intel and PyPI tensorflows. |
I don't know if this is mainly to do with my esoteric data pipeline (shuffling ~100s GB of data, so the CPU performance actually matters in my case even when running on GPUs). I did a few more tests to see if it is consistent across different machines (CPU-only, GPU-CPU, different architectures) and it help up |
That's great news - seems that this is built with fall-backs in place, which isn't always the case (without guards, if you try to run AVX2 code on a non-AVX2 machine, you'll normally crash loudly). Out of curiosity, what does |
yep try
Same, just without the AVX512
|
That's kind of the point - it has some dynamic feature detection and presumably uses this behind the scenes to load the correct libs/symbols. 🙃 |
How do you make sense of the above with this line in build.sh? tensorflow-feedstock/recipe/build.sh Line 134 in c43f282
|
Also, in the future should we follow nvidia's build? Notable differences are:
export TF_NEED_CUDA=1
export TF_NEED_CUTENSOR=1
export TF_NEED_TENSORRT=1
export TF_CUDA_PATHS=/usr,/usr/local/cuda
export TF_CUDA_VERSION=$(echo "${CUDA_VERSION}" | cut -d . -f 1-2)
export TF_CUBLAS_VERSION=$(echo "${CUBLAS_VERSION}" | cut -d . -f 1)
export TF_CUDNN_VERSION=$(echo "${CUDNN_VERSION}" | cut -d . -f 1)
export TF_NCCL_VERSION=$(echo "${NCCL_VERSION}" | cut -d . -f 1)
export TF_TENSORRT_VERSION=$(echo "${TRT_VERSION}" | cut -d . -f 1)
export TF_ENABLE_XLA=1
export TF_NEED_HDFS=0
if [ "${TARGETARCH}" = "amd64" ] ; then export CC_OPT_FLAGS="-march=sandybridge -mtune=broadwell" ; fi
if [ "${TARGETARCH}" = "arm64" ] ; then export CC_OPT_FLAGS="-march=armv8-a" ; fi
export TF_USE_CCACHE
# This is the flags that Google use internally with clang.
# Adding them help make the compilation error more readable even if g++ doesn't do exactly as clang.
# Google use -Werror too. But we can't use it as the error from g++ isn't exactly the same and will trigger useless compilation error.
export CC_OPT_FLAGS="$CC_OPT_FLAGS -Wno-address-of-packed-member -Wno-defaulted-function-deleted -Wno-enum-compare-switch -Wno-expansion-to-defined -Wno-ignored-attributes -Wno-ignored-qualifiers -Wno-inconsistent-missing-override -Wno-int-in-bool-context -Wno-misleading-indentation -Wno-potentially-evaluated-expression -Wno-psabi -Wno-range-loop-analysis -Wno-return-std-move -Wno-sizeof-pointer-div -Wno-sizeof-array-div -Wno-string-concatenation -Wno-tautological-constant-compare -Wno-tautological-type-limit-compare -Wno-tautological-undefined-compare -Wno-tautological-unsigned-zero-compare -Wno-tautological-unsigned-enum-zero-compare -Wno-undefined-func-template -Wno-unused-lambda-capture -Wno-unused-local-typedef -Wno-void-pointer-to-int-cast -Wno-uninitialized-const-reference -Wno-compound-token-split -Wno-ambiguous-member-template -Wno-char-subscripts -Wno-error=deprecated-declarations -Wno-extern-c-compat -Wno-gnu-alignof-expression -Wno-gnu-variable-sized-type-not-at-end -Wno-implicit-int-float-conversion -Wno-invalid-source-encoding -Wno-mismatched-tags -Wno-pointer-sign -Wno-private-header -Wno-sign-compare -Wno-signed-unsigned-wchar -Wno-strict-overflow -Wno-trigraphs -Wno-unknown-pragmas -Wno-unused-const-variable -Wno-unused-function -Wno-unused-private-field -Wno-user-defined-warnings -Wvla -Wno-reserved-user-defined-literal -Wno-return-type-c-linkage -Wno-self-assign-overloaded -Woverloaded-virtual -Wnon-virtual-dtor -Wno-deprecated -Wno-invalid-offsetof -Wimplicit-fallthrough -Wno-final-dtor-non-final-class -Wno-c++20-designator -Wno-register -Wno-dynamic-exception-spec" |
Breaking out the discussion from #175:
I looked at how upstream builds this, and - annoyingly - it's a bit of a mess. There seem to be three different versions of onednn being pulled in, with one of them (0.21.3) being quite old and not available in conda-forge.
I'm happy to start a PR that tries to rip things and see how far we get, but just wanted to get an opinion (@xhochy?) if replacing the upstream version is even realistic given the different versions? Or maybe we just remove the most important one?
The text was updated successfully, but these errors were encountered: