Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

Merged
merged 23 commits into from
Feb 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1ce2427
OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tun…
hfp Feb 4, 2021
69a0e85
Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA …
hfp Feb 4, 2021
3bf854c
Disabled ACC_OPENCL_THREADLOCAL_CONTEXT since DBCSR calls init/finali…
hfp Feb 4, 2021
f14eee4
Updated LIBXSMM prior to v1.17.
hfp Feb 4, 2021
9598017
Attempt to runtime-test OpenCL BE/LIBSMM.
hfp Feb 4, 2021
6721baa
Reduced console output to potentially improve runtime of (CI-)tests.
hfp Feb 4, 2021
ee12e07
Increased timeout from 15m to 20m.
hfp Feb 4, 2021
8baf7ae
Fetch all commits before referring to some SHA.
hfp Feb 4, 2021
2b9335f
Revert "Attempt to runtime-test OpenCL BE/LIBSMM."
hfp Feb 4, 2021
20f9d25
Revised enabling ACC_OPENCL_THREADLOCAL_CONTEXT.
hfp Feb 5, 2021
a26e779
Repeated note about combining auto-tuned parameters for SP and DP in …
hfp Feb 5, 2021
cb91474
Only print device name if the device changed (and avoid duplicated ve…
hfp Feb 5, 2021
9221d1b
Removed tabs from source file (minor/unrelated change).
hfp Feb 5, 2021
8ac6f1d
More prefixes in follow-up of #419 (c_dbcsr_).
hfp Feb 5, 2021
b58a37b
Supply platform when forming context.
hfp Feb 5, 2021
6f3c910
Code cleanup.
hfp Feb 5, 2021
b5cb129
Try to avoid MPS issue (temporarily) testing with only one rank. Sync…
hfp Feb 5, 2021
367a117
Enabled OpenCL based runtime tests.
hfp Feb 5, 2021
6c9f84c
Fixed CI-scripts.
hfp Feb 5, 2021
9ff03d9
Fixed another variable which was left unbound (CI-script).
hfp Feb 5, 2021
d17f03f
Incorporated #428.
hfp Feb 8, 2021
1265b26
Merge branch 'develop' of https://github.com/cp2k/dbcsr into oclverbose
hfp Feb 8, 2021
2e57682
Warn about potentially exclusive device-mode.
hfp Feb 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .ci/daint.cscs.ch/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ pipeline {
run_batch("0:15:00", "ocl", "build")
}
}
// stage('test') {
// steps {
// run_batch("1:00:00", "ocl", "test")
// }
// }
stage('test') {
steps {
run_batch("1:00:00", "ocl", "test")
}
}
}
}
stage("Intel") {
Expand Down
7 changes: 4 additions & 3 deletions .ci/daint.cscs.ch/ocl.build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#SBATCH --constraint="mc"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --hint=nomultithread

set -o errexit
Expand All @@ -23,7 +23,8 @@ if [ ! -d "${HOME}/libxsmm" ]; then
git clone https://github.com/hfp/libxsmm.git
fi
cd "${HOME}/libxsmm"
git checkout 02d6ab213a35d5fc2f6454c3b465598b0c086c17
git fetch
git checkout 05cab50ec6f11a86c15c0ed511c5a9066c613dfb
make -j
cd ..

Expand Down
7 changes: 3 additions & 4 deletions .ci/daint.cscs.ch/ocl.test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-node=1
#SBATCH --hint=nomultithread

set -o errexit
Expand All @@ -21,10 +20,10 @@ set -o xtrace # do not set earlier to avoid noise from module

umask 0002 # make sure group members can access the data

mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}.ocl"
mkdir -p "${SCRATCH}/${BUILD_TAG}.ocl"
chmod 0775 "${SCRATCH}/${BUILD_TAG}.ocl"
cd "${SCRATCH}/${BUILD_TAG}.ocl"

export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake

Expand Down
16 changes: 11 additions & 5 deletions src/acc/acc_bench_smm.c
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,18 @@ int main(int argc, char* argv[])
printf("%s%s%i %i %i %i %i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "",
nrepeat, stack_size, m, n, k, nc, na, nb);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -165,14 +168,14 @@ int main(int argc, char* argv[])
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), n, k, MAX_KERNEL_DIM, stream), &result);
}
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
start = libxsmm_timer_tick();
#endif
# endif
/* to perform NN-SMMs on the device, all B-matrices are transposed upfront (SMM-kernel is limited to NT) */
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), k, n, MAX_KERNEL_DIM, stream), &result);
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
transpose = libxsmm_timer_duration(start, libxsmm_timer_tick());
# endif
Expand Down Expand Up @@ -282,6 +285,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(bmat_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(cmat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
10 changes: 8 additions & 2 deletions src/acc/acc_bench_trans.c
Original file line number Diff line number Diff line change
Expand Up @@ -91,15 +91,18 @@ int main(int argc, char* argv[])
assert(m <= (mn / n) && 0 == (mn % n));
printf("%s%s%i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "", nrepeat, stack_size, m, n);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -210,6 +213,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(stack_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(mat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
9 changes: 3 additions & 6 deletions src/acc/cuda/Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
INCACC := $(wildcard *.h*) ../acc.h
SRCACC := $(wildcard *.cpp)
OBJACC := $(SRCACC:.cpp=.o) acc_cublas.o
OBJACC := $(SRCACC:.cpp=.o)

GPUSMM := $(wildcard ../libsmm_acc/kernels/*.h*)
INCSMM := $(wildcard ../libsmm_acc/*.h*) ../acc_libsmm.h \
Expand Down Expand Up @@ -130,10 +130,7 @@ test: ../dbcsr_acc_test
../libsmm_acc/smm_acc_kernels.h: $(GPUSMM) Makefile ../libsmm_acc/generate_kernels.py ../libsmm_acc/parameters/parameters_$(WITH_GPU).json
@cd ../libsmm_acc && $(PYTHON) ../libsmm_acc/generate_kernels.py ../libsmm_acc/kernels

acc_cublas.o: acc_cublas.cu Makefile
$(NVCC) $(addprefix -Xcompiler $(NULL),$(CXXFLAGS)) -c $< -o $@

../dbcsr_acc.a: $(OBJACC) acc_cublas.o ../libsmm_acc/libsmm_acc_init.o
../dbcsr_acc.a: $(OBJACC) ../libsmm_acc/libsmm_acc_init.o
$(AR) -rs $@ $^

../dbcsr_acc_smm.a: $(OBJSMM)
Expand All @@ -153,7 +150,7 @@ acc_bench_trans.o: ../acc_bench_trans.c Makefile
$(CXX) $^ $(LDFLAGS) -o $@

dbcsr_acc_test.o: ../../../tests/dbcsr_acc_test.c Makefile
$(CC) $(CFLAGS) -c $< -o $@
$(CC) $(CFLAGS) -I../.. -c $< -o $@
hfp marked this conversation as resolved.
Show resolved Hide resolved
../dbcsr_acc_test: dbcsr_acc_test.o ../dbcsr_acc_smm.a ../dbcsr_acc.a
$(CXX) $^ $(LDFLAGS) -o $@

Expand Down
8 changes: 4 additions & 4 deletions src/acc/cuda/acc_cublas.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,10 @@ int acc_blas_dgemm(ACC_BLAS(Handle_t) *handle, char transa, char transb,
ACC_BLAS_CALL(SetStream, (*handle, *stream));

ACC_BLAS_CALL(Dgemm, (*handle, cTransa, cTransb,
m, n, k,
&alpha, &a_data[a_offset], lda,
&b_data[ b_offset], ldb,
&beta, &c_data[ c_offset], lda));
m, n, k,
&alpha, &a_data[a_offset], lda,
&b_data[ b_offset], ldb,
&beta, &c_data[ c_offset], lda));

return(0);
}
11 changes: 8 additions & 3 deletions src/acc/libsmm_acc/libsmm_acc_benchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -350,9 +350,12 @@ int libsmm_acc_benchmark(libsmm_acc_benchmark_t* h,
best_gflops = gflops;
best_kernel = ikern;
}
} else {
}
#if !defined(NDEBUG)
else {
printf("%sOK %s\n", msg_prefix, descr);
}
#endif
}

if(h->mode == tune){
Expand Down Expand Up @@ -427,10 +430,12 @@ int libsmm_acc_benchmark_transpose_(int n_stack, int* stack, int* d_stack,
if(sumGPU != sumCPU){
printf("%sERROR %s checksum_diff: %g\n", msg_prefix, descr, sumGPU-sumCPU);
error_counter++;
} else {
}
#if !defined(NDEBUG)
else {
printf("%sOK %s\n", msg_prefix, descr);
}

#endif
return error_counter;

}
Expand Down
5 changes: 4 additions & 1 deletion src/acc/opencl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The OpenCL backend implements the [ACC interface](https://github.com/cp2k/dbcsr/

### Compile-time Settings

Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_VERBOSE` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.
Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_DEBUG` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.

An application of compile-time settings (and perhaps a valuable contribution) might be to call a GPU library in OpenCL-based LIBSMM. In such case, Shared Virtual Memory support (SVM) in OpenCL comes handy and can be enabled per `ACC_OPENCL_SVM`. The latter allows then to simply take the raw pointer out of an `cl_mem` object, and pass it into such library/function (which in turn can work across language borders, etc.).

Expand All @@ -19,6 +19,9 @@ Runtime settings are made by the means of environment variables (implemented in
* `ACC_OPENCL_VENDOR`: character string matching the vendor of the OpenCL device in an case-insensitive fashion, e.g., "intel".
* `ACC_OPENCL_DEVTYPE`: character string matching the device-kind like "cpu", "gpu", or another kind if neither CPU or GPU.
* `ACC_OPENCL_DEVICE`: non-negative integer number to select a device from the (internally enumerated) list of devices.
* `ACC_OPENCL_VERBOSE`: verbosity level (integer).
* `ACC_OPENCL_VERBOSE=1`: outputs (stderr) the number of devices found and the name of the selected device.
* `ACC_OPENCL_VERBOSE=2`: outputs (stderr) the duration needed to generate a requested kernel.

The OpenCL backend enumerates and orders devices primarily by device-kind (GPU, CPU, and others in that order) and by memory capacity (secondary criterion). Device IDs are zero-based as per ACC interface (and less than what is permitted/returned by `acc_get_ndevices`).

Expand Down
Loading