cmake \
-Dalpaka_ACC_ANY_BT_OMP5_ENABLE=on \
-DBUILD_TESTING=on \
-Dalpaka_BUILD_EXAMPLES=on \
Also set -DCMAKE_CXX_FLAGS
in accordance with compiler and target. Information
about this is maintained
on readthedocs.
Shared memory is implemented using a small object allocator in
BlockSharedMemStOmp5
using a fixed-size buffer allocated by
BlockSharedMemDynMember
, making these two elements linked.
OpenMP 5 offers the directive omp allocate allocator(omp_pteam_mem_alloc)
(used by BlockSharedMemStOmp5BuiltIn
) which can in theory be used for static
shared memory variables. There is no useful built-in support for dynamic
block-shared memory to go with that. Usage of the built-in can be configured
using the alpaka_OFFLOAD_USE_BUILTIN_SHARED_MEM
flag:
alpaka_OFFLOAD_USE_BUILTIN_SHARED_MEM=OFF
: Do not useomp allocate
(default, only available behavior with OpenMP < 5).alpaka_OFFLOAD_USE_BUILTIN_SHARED_MEM=DYN_FIXED
: Useomp allocate
, use a fixed size team-shared array for dynamic shared mem (fixed size isalpaka_BLOCK_SHARED_DYN_MEMBER_KIB
).alpaka_OFFLOAD_USE_BUILTIN_SHARED_MEM=DYN_ALLOC
: Useomp allocate
, use aomp_alloc()
API call in the target region to allocate dynamic shared memory. The standard appears to allow this, but is not useful for some reasons:- In the best case, this would lead to an on-device
malloc
on GPU, which has bad performance and does not use on-chip memory. - At least in clang GPU targets (nvptx64, hsa), the symbols
omp_alloc
andomp_free
are undefined (linker error, code compiles).
- In the best case, this would lead to an on-device
blockSharedSharingTest tests correct sharing.
compiler | target | OFF |
DYN_FIXED |
DYN_ALLOC |
---|---|---|---|---|
clang 14 (1.) | x86 | ✅ | ✅ (2.) | ✅ (2.) |
clang 14 (1.) | nvptx | ✅ | ✅ | E (3.) |
clang 14 (1.) | hsa | C | C | E (3.) |
gcc 11 | x86 | ✅ | N | N |
gcc 11 | nvptx | ✅/❌ (4.) | N | N |
nvhpc 22.1 | x86 | ✅ | N (5.) | N (5.) |
nvhpc 22.3 | nvptx | ✅ | N (5.) | N (5.) |
Keys:
- ✅: Test Passes.
- ❌l: Test fails, shared mem not shared.
- ❌g: Test fails, shared mem gloal/shared too widely.
- ❌: Test fails for other reason.
- C: Test compiles, not run.
- E: Test does not build.
- N: Not supported.
Footnotes:
- git main
95a436f8cca6991dc0f30588d9b1af3223818168
omp allocate
does not actually work, the variable beingstatic
makes it work, which in itself is non-conforming behavior.- Linker error: no symbols
omp_alloc
,omp_free
for target code. - Apparently gcc's OpenMP runtime will not run more than 8 threads per block on
GPU: Pass for
blockThreadCount <= 8
, fail for more. - NVHPC 22.1 claims to support OpenMP 5.1 (
_OPENMP = 202011
).
- No separabel compilation. OpenMP 5 requires functions for which device code should be generated for a
not-inlined call in a target region to be marked with pragmas. This cannot be
wrapped by macros like
ALPAKA_FN_DEVICE
because they appear between template parameter list and function name and because OpenMP requires two macros to mark a region around the function. alpaka-group#1126 (comment)
make vectorAdd
./example/vectorAdd/vectorAdd
If the run is successful, the last output line will be Execution results correct!
otherwise it will print items where the result from the offload code
disagrees with the expected result and print Execution results incorrect!
at the end.
target | compiler | compile status | target | run status |
---|---|---|---|---|
vectorAdd | ||||
GGC 10 | ok | host | ok | |
GGC 10 | ptxas error (2) | nvptx | -- | |
AOMP 0.7-4 | ok | x86 | omp_target_alloc() returns 0 | |
AOMP 0.7-4 | linker: multiple def. of gpuHeap (1) | amdhsa | -- | |
AOMP 0.7-5 | ok | x86 | ok | |
AOMP 0.7-5 | ok | amdhsa | ok | |
LLVM 10 | ok | x86 | ok | |
XL 16.1.1-5 (Summit) | ok | nvptx | ok (num_threads workaround) (3) | |
XL 16.1.1-5 (Summit) | ok | ppc64le | sigsegv (device mem alloc'son GPU) |
- error: Linking globals named 'gpuHeap': symbol multiply defined!
/usr/bin/ld: cannot find a.out-openmp-amdgcn-amd-amdhsa-gfx900 /usr/bin/ld: cannot find a.out-openmp-amdgcn-amd-amdhsa-gfx900 clang-9: error: amdgcn-link command failed with exit code 1 (use -v to see invocation) clang-9: error: linker command failed with exit code 1 (use -v to see invocation)
- ptxas:
Linking CXX executable vectorAdd ptxas /tmp/cccKHuiQ.o, line 216; error : Label expected for argument 0 of instruction 'call' ptxas /tmp/cccKHuiQ.o, line 216; error : Function '_ZN6alpaka3ctx12CtxBlockOaccISt17integral_constantImLm1EEmEC1ERKNS_3vec3VecIS3_mEES9_S9_RKmSB_' not declared in this scope ptxas /tmp/cccKHuiQ.o, line 216; fatal : Call target not recognized ptxas fatal : Ptx assembly aborted due to errors nvptx-as: ptxas returned 255 exit status
- IBM XL: When setting num_threads, either in #pragma omp parallel or via omp_set_num_threads to any value the runtime only executes one thread per team. Workaround is to not do that with XL, which leads to $OMP_NUM_THREADS being run per team. Minimal example: https://github.com/jkelling/omp5tests/blob/master/parallel/parallel.cpp
Run make
and upon success ctest
.
test | compiler | compile status | target | run status |
---|---|---|---|---|
ALL | ||||
LLVM 10 | ok | x86 | pass | |
LLVM 11 | ok | x86 | pass | |
AOMP 0.7-5 | linker error with static lib (7)x86 | -- | ||
AOMP 0.7-5 | linker error with static lib (8) | amdhsa | -- | |
GCC 10 | mixed(1) | host | target alloc fail(2) | |
GCC 11 | ok | host | target alloc fail(2) | |
XL 16.1.1-5 (Summit) | no-halt [6] | nvptx | -- | |
XL 16.1.1-5 (Summit) | no-halt [6] | ppc64le | -- |
- Targets with multiple compilation units fail to link. alpaka-group#1126 (comment)
omp_target_alloc()
allocates memory on GPU while code runs on host and tries access it there => segfault- _
- _
- _
- XL does not appear to terminate when compiling targets like
blockShared
in which tests are executed through the fixture in ~alpaka/test/common/include/alpaka/test/KernelExecutionFixture.hpp . Removing the call alpaka/test/unit/block/shared/src/BlockSharedMemDyn.cpp:92-94 yields finite compilation time for BlockSharedMemDyn.cpp.o . XL is extremely slow compiling code using the test framework catch2 used in Alpaka. - aomp 0.7-5 x86:
/usr/bin/ld: cannot find libcommon-openmp-x86_64-pc-linux-gnu-sm_20.o: No such file or directory /usr/bin/ld: cannot find libcommon-host-x86_64-unknown-linux-gnu.o: No such file or directory clang-9: error: linker command failed with exit code 1 (use -v to see invocation) clang-9: error: linker command failed with exit code 1 (use -v to see invocation) test/integ/matMul/CMakeFiles/matMul.dir/build.make:85: recipe for target 'test/integ/matMul/matMul' failed
- aomp 0.7-5 HSA:
/home/kelling/rocm/aomp_0.7-5/bin/clang-build-select-link: libcommon-openmp-amdgcn-amd-amdhsa-gfx900.o:1:2: error: expected integer !<arch> ^ /home/kelling/rocm/aomp_0.7-5/bin/clang-build-select-link: error: loading file 'libcommon-openmp-amdgcn-amd-amdhsa-gfx900.o' /usr/bin/ld: cannot find a.out-openmp-amdgcn-amd-amdhsa-gfx900 /usr/bin/ld: cannot find a.out-openmp-amdgcn-amd-amdhsa-gfx900 clang-9: error: amdgcn-link command failed with exit code 1 (use -v to see invocation) clang-9: error: linker command failed with exit code 1 (use -v to see invocation)