Regarding CUDA-awareness #6197

AtlantaPepsi · 2022-09-27T18:43:15Z

AtlantaPepsi
Sep 27, 2022

I have some doubts about what happens under the hood of CUDA-aware operations in MPICH, I'll use following snippet as an example:

for (int i = 0; i < 3; i++) {  \\every process has 3 neighbor
  MPI_Irecv(rbuf, n*n, MPI_DOUBLE, ...); \\rbuf is device pointer
  kernel<<< ... >>>(sbuf,d_data); \\each process pack some local data into send buffer, all memory address are allocated on device
  cudaDeviceSynchronize(); \\if Isend is not stream triggered
  MPI_Isend(sbuf, n*n, MPI_DOUBLE, ...);
}

I inspected both regular Isend/recv as well as stream enqueued MPIX_Isend/recv. I want to confirm first about with stream enqueued case, does MPIX_send_enqueue between device addresses always do a host side malloc then copy data back to pinned host memory?
It seems so from profiler output (purple part is the DtoH memcpy, and cudaMallocHost are responsible for majority of the execution time), so I found this following part. I think MPIR_GPU_query_pointer_is_dev asserts whether pointer is device address, right?
I see similar but different sequence of events for regular MPI_Isend/recv. The very first Send or Receive will malloc host side memory and do a cudaMemcpyAsync, then subsequent operations are very fast and without copy back (following the first long MPI_Isend the next two are very short).

I haven't found the corresponding source code yet, could you explain what's going on here @hzhou ? Why does first MPI_Isend allocate host memory, and what data is it sending back to host from device? If it's the actual data to be transferred to host, why do subsequent calls not involve copy back?

hzhou · 2022-09-27T19:00:06Z

hzhou
Sep 27, 2022
Maintainer

I inspected both regular Isend/recv as well as stream enqueued MPIX_Isend/recv. I want to confirm first about with stream enqueued case, does MPIX_send_enqueue between device addresses always do a host side malloc then copy data back to pinned host memory?

You observed correctly. In the current code, we are just focusing on getting the semantics right as a first stage. The current implementation is merely a basic or fallback implementation, and bad performance is expected. We are looking at optimizations, which will depend on the synchronization methods that are available in the GPU runtime and the triggering capability from the NIC. Expect the performance in our future releases.

12 replies

hzhou Sep 27, 2022
Maintainer

If it is intra-node device to device, it should go to the ipc send path. Just to clarify, which code are you looking at?

AtlantaPepsi Sep 27, 2022
Author

I was referring to my own code snippet I wrote in the question, it's basically the exact same lines except with full parameters.

The highlighted part in the profiler output is a MemcpyAsync, Device to Host. I thought it was a part of Isend. If it's as you said, sender send host-side handle to receiver, and data directly send device to device, there shouldn't be a host memory allocation and copy from device to host, right?

AtlantaPepsi May 12, 2023
Author

Hi @hzhou , I just wanted to check back on progress with stream-triggered implementations. I see you made some changes since later last year, but have you had more thoughts on this:

The current implementation is merely a basic or fallback implementation, and bad performance is expected. We are looking at optimizations, which will depend on the synchronization methods that are available in the GPU runtime and the triggering capability from the NIC.

So far two things stand out in MPICH implementation of GPU stream enqueued APIs: that it forces data back to host for communications, and it relies on host callback as trigger mechanism.

For the first one, I wonder what is preventing a CUDA-Aware implementation of stream-triggered calls. Is it just simple programming effort of adding some "if", or is there some actual hazard with calling cudaLaunchHostFunc on an CUDA-Aware MPI call? I tried Jim's MPI-ACX and it seemed to have some weird deadlock for stream-triggered Send/Recv directly on device data. I'm not sure if something along that line is what you had in mind, when you mentioned additional "synchronization methods that are available in the GPU runtime"?

Secondly, for the trigger mechanism, seems MPICH still relies on cuda callback, while Cray MPI and MPI-ACX adapt a host side monitor approach. At best, such implementations of ST MPI are no worse than regular MPI calls in terms of performance. I know little about communication backends, but I'm very curious how exactly would a "triggering capability" directly from the NIC be better in terms of performance, and how do you envision that to be implemented?

I've done some study and testing with MPIX Stream, I'm fascinated and would love to make contribution for further improvement if you could shed some lights on above issues.

hzhou May 12, 2023
Maintainer

We have implemented host-side background thread and direct kernel synchronization via atomic mem-ops - #6062. And indeed, we are seeing occasional deadlocks. The reason is the mem-ops synchronization is invisible to CUDA runtime and it cannot reliably prevent deadlocks. To make progress, we need research efforts to measure and compare the performance as well as identify robust mechanisms to prevent deadlocks. Collaborations welcome!

AtlantaPepsi May 12, 2023
Author

We have implemented host-side background thread and direct kernel synchronization via atomic mem-ops - #6062

🤔the commit date was way before the opening of this discussion, I think I completed missed this part in MPICH. I directly dove into MPIR implementations mpich/src/mpi/stream/ and skipped MPID layer...

I'll read the code more carefully, but I assume the design principle is still: the host-side thread buffers incoming MPI call, monitor the progress on GPU streams, then fire the "regular", high-level MPI calls.

AtlantaPepsi · 2022-09-27T20:41:25Z

AtlantaPepsi
Sep 27, 2022
Author

With intra-node device to device, the sender will pass the memory handle to receiver, the receiver will map the haddle to their process address space, then do a e.g. cudaMemcpy.

@hzhou I guess it's difficult to lock down the exact lines, but the receiver operation you described also under src/mpid/ch4/shm/ipc/src? I still don't understand what specifically is being copied by cudaMemcpy your description, like the handle object?

0 replies

hzhou · 2022-09-27T22:08:28Z

hzhou
Sep 27, 2022
Maintainer

The actual ipc code should be in src/mpid/ch4/shm/ipc/src/ipc_p2p.h. The cudaMemcpy should be inside MPIR_Typerep_unpack. Is your datatype noncontiguous?

2 replies

AtlantaPepsi Sep 27, 2022
Author

I'm not using non-contiguous datatypes. The whole point of my actual kernel is to manually move strided data to contiguous buffer.

AtlantaPepsi Sep 27, 2022
Author

i'll do some even simpler testing just for regular device to device MPI_Isend, just double checking if such host side behavior persists...

AtlantaPepsi · 2022-10-04T22:08:31Z

AtlantaPepsi
Oct 4, 2022
Author

@hzhou a small update on this:
the above behavior is due to mixture of internode and intranode communication. In my code there's 2x2x2 processes, each process talking to three neighbor on x,y,z axis. on Perlmutter 4 devices reside on same node connected via NVLink, the Isend between them are done directly through DtoD cudaMemcpy.

0 replies

AtlantaPepsi · 2022-10-04T22:19:00Z

AtlantaPepsi
Oct 4, 2022
Author

so the real question is why does MPICH do copy back to host for internode communication.

I have double checked with different job partition that internode communication usually involves malloc to host and cudaMemcpy DtoH.
I also confirmed that GPUs on Perlmutter are connected to NIC directly on the RDMA network

Any additional suggestions on why this could happen? Here i'll also attach the build script, we were using the tar file of 4.1a1 on website, and environments are COMPILER= nvidia, ARCH=x86-milan_nvidia80, NET=ofi:

#!/bin/bash
set -ex
#------- information
# Supports a matrix build of MPICH for 
# compiler = [llvm nvidia gnu cray]
# arch = [x86-milan, x86-milan_nvidia80]
# net = [ofi]
#------- common functions and version information
pkgdir=$(dirname ${BASH_SOURCE})
source ${pkgdir}/../common.sh
source ${pkgdir}/version.sh
#------- setup the build
pkg=$(realpath $(dirname ${BASH_SOURCE}))
build_dir=$(mktemp -d --tmpdir=/tmp ${name}.XXXXXXX)
cd ${build_dir}
threads=${threads:=32}
cache ${tarfile} ${url}
tar --extract --file ${NPE_CACHE}/${tarfile}
cd $(basename ${tarfile} .tar.gz)
#------- configure
PMI=/opt/cray/pe/pmi/default
XPMEM=/opt/cray/xpmem/default
FABRIC=/opt/cray/libfabric/1.11.0.4.75

opts=""
opts+=" --prefix=${prefix}"
opts+=" --with-pm=no"
opts+=" --with-pmi=cray"
opts+=" --with-xpmem=${XPMEM}"
opts+=" --with-wrapper-dl-type=rpath"
opts+=" --enable-threads=multiple"
opts+=" --enable-shared=yes"
opts+=" --enable-static=no"

CPPFLAGS="-I${PMI}/include"
LDFLAGS="-L${PMI}/lib -L${XPMEM}/lib64"
LIBS="-lpmi -lpmi2 -lxpmem"
CFLAGS=""

# net
case ${NET} in
    ofi/1.0)
	set +x
	module load libfabric
	set -x
	FABRIC_INCLUDE=$(pkg-config --variable=includedir libfabric)
	FABRIC_LIB_DIR=$(pkg-config --variable=libdir libfabric)
	FABRIC_LIB=$(pkg-config --libs-only-l libfabric)
	opts+=" --with-libfabric-include=${FABRIC_INCLUDE} --with-libfabric-lib=${FABRIC_LIB_DIR}"
	opts+=" --with-device=ch4:ofi:verbs;ofi_rxm"
	LDFLAGS+=" -L"${FABRIC_LIB_DIR}
	LIBS+=" "${FABRIC_LIB}
	;;
    *)
	echo 'unsupported network '${NET}
	exit 1
	;;    
esac
# arch
case ${ARCH} in
    x86-milan)
	set +x
	module load craype-x86-milan
	set -x
	;;
    x86-milan_nvidia80)
	set +x
	module load craype-x86-milan
	module load cudatoolkit
	module load craype-accel-nvidia80
	set -x
	opts+=" --with-cuda=${CUDA_HOME}"
	LDFLAGS+=" -L${CUDA_HOME}/lib64"
	LIBS+=" -Wl,--as-needed,-lcudart,--no-as-needed -lcuda"
	;;
    *)
	echo 'unsupported arch '${ARCH}
	exit 1
	;;
esac
# compiler
case ${COMPILER} in
    llvm/14)
	set +x
	module use ${NPE_PREFIX}/modulefiles
	module load npe/${NPE_REV}
	module load llvm
	set -x
	ml
	# only needed *_nvidia80 for llvm/nightly, but doesn't hurt to do it anyway
	patch modules/yaksa/configure < ${pkg}/nvcc-allow-unsupported-compiler.patch
	# flang isn't ready yet
	opts+=" --enable-fortran=no"
	CC=clang
	CXX=clang++
	;;
    gnu/8.0)
	set +x
	module load PrgEnv-gnu
	set -x
	CC=gcc
	CXX=g++
	FC=gfortran
	FCFLAGS=-fallow-argument-mismatch
	;;
    nvidia/20)
	set +x
	module load PrgEnv-nvidia
	set -x
	CC=pgcc
	CXX=pgc++
	FC=pgfortran
	;;
    *)
	echo 'unsupported compiler '${COMPILER}
	exit 1
	;;
esac
#
#which ${FC}
#exit 1
set +x
# unload everything
module unload perftools-base xalt cray-dsmml cray-libsci cray-mpich
module load cray-pmi cray-pmi-lib
module list
set -x
#
#------- configure
./configure ${opts} \
	    CPPFLAGS=${CPPFLAGS} \
	    CC=${CC} \
	    CFLAGS=${CFLAGS} \
	    CXX=${CXX} \
	    FC=${FC} \
	    FCFLAGS=${FCFLAGS} \
	    F77=${FC} \
	    FFLAGS=${FCFLAGS} \
	    LIBS="${LIBS}" \
	    LDFLAGS="${LDFLAGS}" \
	    MPICHLIB_CFLAGS="-fPIC" \
	    MPICHLIB_CXXFLAGS="-fPIC" \
	    MPICHLIB_FFLAGS="-fPIC" \
	    MPICHLIB_FCFLAGS="-fPIC"


#------- compile
make -j ${threads}
#------- install
make install
#------- cleanup
cleanup

common.sh:

set -ex
#------- PE options
NPE_PREFIX=${NPE_PREFIX:=${HOME}/opt}
NPE_REV=${NPE_REV:=$(cat /etc/nersc_modules_rev)}
NPE_CACHE=${NPE_CACHE:=$SCRATCH/.nersc_pe_tar_cache}
NPE_MODULEFILES=${NPE_PREFIX}/${NPE_REV}/modulefiles
#------- helper functions
cleanup() {
    if [[ -z ${NPE_CLEANUP_OFF} ]]; then
	rm -rf $build_dir
    fi
}
trap cleanup ERR

cache() {
    local filename="$1"
    local url="$2"
    if ! [ -e ${NPE_CACHE}/${filename} ]; then
	mkdir -p ${NPE_CACHE}
	wget ${url} -O ${NPE_CACHE}/${filename}
    fi
    echo ${NPE_CACHE}/${filename}
}

and version.sh:

name=mpich
version=4.1a1
tarfile=${name}-${version}.tar.gz
url=[https://www.mpich.org/static/downloads/${version}/${tarfile}](https://www.mpich.org/static/downloads/$%7Bversion%7D/$%7Btarfile%7D)
prefix=${NPE_PREFIX}/${NPE_REV}/${name}/${version}/${NET}/${COMPILER}/${ARCH}

1 reply

hzhou Oct 4, 2022
Maintainer

so the real question is why does MPICH do copy back to host for internode communication.

We rely on libfabric provider for RDMA GPU direct, in the case of Perlmutter, the cxi provider. I suspect somehow the FI_HMEM capability is not turned on even though the cxi provider supports it. Try the patches in this PR #6098 to force enable the RDMA path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding CUDA-awareness #6197

{{title}}

Replies: 5 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Regarding CUDA-awareness #6197

AtlantaPepsi Sep 27, 2022

Replies: 5 comments · 15 replies

hzhou Sep 27, 2022 Maintainer

hzhou Sep 27, 2022 Maintainer

AtlantaPepsi Sep 27, 2022 Author

AtlantaPepsi May 12, 2023 Author

hzhou May 12, 2023 Maintainer

AtlantaPepsi May 12, 2023 Author

AtlantaPepsi Sep 27, 2022 Author

hzhou Sep 27, 2022 Maintainer

AtlantaPepsi Sep 27, 2022 Author

AtlantaPepsi Sep 27, 2022 Author

AtlantaPepsi Oct 4, 2022 Author

AtlantaPepsi Oct 4, 2022 Author

hzhou Oct 4, 2022 Maintainer

AtlantaPepsi
Sep 27, 2022

Replies: 5 comments 15 replies

hzhou
Sep 27, 2022
Maintainer

hzhou Sep 27, 2022
Maintainer

AtlantaPepsi Sep 27, 2022
Author

AtlantaPepsi May 12, 2023
Author

hzhou May 12, 2023
Maintainer

AtlantaPepsi May 12, 2023
Author

AtlantaPepsi
Sep 27, 2022
Author

hzhou
Sep 27, 2022
Maintainer

AtlantaPepsi Sep 27, 2022
Author

AtlantaPepsi Sep 27, 2022
Author

AtlantaPepsi
Oct 4, 2022
Author

AtlantaPepsi
Oct 4, 2022
Author

hzhou Oct 4, 2022
Maintainer