Regarding CUDA-awareness #6197
Replies: 5 comments 15 replies
-
You observed correctly. In the current code, we are just focusing on getting the semantics right as a first stage. The current implementation is merely a basic or fallback implementation, and bad performance is expected. We are looking at optimizations, which will depend on the synchronization methods that are available in the GPU runtime and the triggering capability from the NIC. Expect the performance in our future releases. |
Beta Was this translation helpful? Give feedback.
-
@hzhou I guess it's difficult to lock down the exact lines, but the receiver operation you described also under |
Beta Was this translation helpful? Give feedback.
-
The actual ipc code should be in |
Beta Was this translation helpful? Give feedback.
-
@hzhou a small update on this: |
Beta Was this translation helpful? Give feedback.
-
so the real question is why does MPICH do copy back to host for internode communication.
Any additional suggestions on why this could happen? Here i'll also attach the build script, we were using the tar file of 4.1a1 on website, and environments are COMPILER= nvidia, ARCH=x86-milan_nvidia80, NET=ofi:
common.sh:
and version.sh:
|
Beta Was this translation helpful? Give feedback.
-
I have some doubts about what happens under the hood of CUDA-aware operations in MPICH, I'll use following snippet as an example:
I inspected both regular
Isend/recv
as well as stream enqueuedMPIX_Isend/recv
. I want to confirm first about with stream enqueued case, doesMPIX_send_enqueue
between device addresses always do a host side malloc then copy data back to pinned host memory?It seems so from profiler output (purple part is the DtoH memcpy, and
cudaMallocHost
are responsible for majority of the execution time), so I found this following part. I thinkMPIR_GPU_query_pointer_is_dev
asserts whether pointer is device address, right?I see similar but different sequence of events for regular
MPI_Isend/recv
. The very first Send or Receive will malloc host side memory and do acudaMemcpyAsync
, then subsequent operations are very fast and without copy back (following the first long MPI_Isend the next two are very short).I haven't found the corresponding source code yet, could you explain what's going on here @hzhou ? Why does first
MPI_Isend
allocate host memory, and what data is it sending back to host from device? If it's the actual data to be transferred to host, why do subsequent calls not involve copy back?Beta Was this translation helpful? Give feedback.
All reactions