Big latency between dynamic processes #6100

dalcinl · 2022-07-22T13:58:16Z

dalcinl
Jul 22, 2022

See the attached reproducer pingpong.c.txt.

This example is a Send+Recv (PingPong) test between a pair of MPI processes, considering the following two scenarios:

If run with two MPI processes, PingPong between ranks 0 and 1 of COMM_WORLD.
If run with one MPI process, a new child process is spawned with MPI_Comm_spawn, and then PingPong between parent and child.

My workstation has the following CPU: AMD Ryzen Threadripper PRO 3995WX 64-Cores

This is what I get running MPICH v3.4.3 (ch3:nemesis) from Fedora packages

$ mpiexec -n 2 ./a.out 
samples: 100000  buflen: 65536  latency: 3.94 us

$ mpiexec -n 1 ./a.out 
samples: 100000  buflen: 65536  latency: 45.01 us

This is what I get running MPICH main branch (ch4:ofi) with no explicit optimization flags to configure:

$ mpiexec -n 2 ./a.out 
samples: 100000  buflen: 65536  latency: 18.63 us

$ mpiexec -n 1 ./a.out 
samples: 100000  buflen: 65536  latency: 65.07 us

Similar story on macOS with Homebrew MPICH (v4.0.2, ch4:ofi)

$ mpiexec -n 2 ./a.out 
samples: 100000  buflen: 65536  latency: 10.92 us

$ mpiexec -n 1 ./a.out 
samples: 100000  buflen: 65536  latency: 78.88 us

What's going on? Is such large difference in performance expected?

PS: We found this performance issue after running microbenchmarks for mpi4py.futures, which uses MPI_Comm_spawn() from a parent process to setup a pool of MPI worker processes.

CC: @mrogowski

Answered by hzhou

Jul 22, 2022

The communications with spawned processes go through netmod while COMM_WORLD uses shared memory for intra-node. The netmod performance depends on the libfabric provider. The default sockets provider is not particularly optimized. I am not sure why ch4 shared memory latency is so much higher in ch4 (18.63 vs 3.94). We'll investigate.

View full answer

hzhou · 2022-07-22T15:36:44Z

hzhou
Jul 22, 2022
Maintainer

The communications with spawned processes go through netmod while COMM_WORLD uses shared memory for intra-node. The netmod performance depends on the libfabric provider. The default sockets provider is not particularly optimized. I am not sure why ch4 shared memory latency is so much higher in ch4 (18.63 vs 3.94). We'll investigate.

0 replies

dalcinl · 2022-07-22T20:32:37Z

dalcinl
Jul 22, 2022
Author

The communications with spawned processes go through netmod

That's what I was suspecting. Would it be too much complicate for parent and children processes to connect via the faster shared memory channel if parent and child run on the same node? Dynamic process management has always been a sort of second class citizen in MPI world 😞.

0 replies

dalcinl · 2022-07-22T20:43:08Z

dalcinl
Jul 22, 2022
Author

I am not sure why ch4 shared memory latency is so much higher in ch4 (18.63 vs 3.94). We'll investigate.

I just did a manual build of mpich/main with ch3:nemesis, and I got similar latency as with v3.4.3 from the Fedora package. So yes, ch4:ofi seems to have significantly higher latency than ch3:nemesis, at least in my workstation.

0 replies

hzhou · 2022-07-22T20:56:10Z

hzhou
Jul 22, 2022
Maintainer

The communications with spawned processes go through netmod

That's what I was suspecting. Would it be too much complicate for parent and children processes to connect via the faster shared memory channel if parent and child run on the same node? Dynamic process management has always been a sort of second class citizen in MPI world 😞.

It will be much more complicated. If we need to put dynamic processes to the first class, we'll need to redesign the library, so maybe ch5. Every shift of focus in redesign, we gain something and lose something. The shared memory performance you are witnessing between ch3 and ch4 is of this nature.

I think dynamic process in HPC will always be of second class. But maybe we can propose something in-between. For example, reserve comm-world(universe)-equivalent at launch/init, so we can pave the necessary structure at init, then the dynamic spawn at runtime can fit into the optimized structure.

0 replies

hzhou · 2022-07-25T16:33:12Z

hzhou
Jul 25, 2022
Maintainer

@dalcinl Refer to my testing in #6072. ch4 design polls both netmod progress and shm progress individually in each progress loop, thus if an empty netmod progress poll is significant, it may explain the large shm latency penalty you observed. The effect of this ch4-design penalty may be amplified by relative latency between shm performance and netmod empty poll latency. I am testing on a Intel i7 CPU, which is less drastic. I recall the newer AMD Ryzen CPU can have much better shm performance, which may amplify the difference (between ch3 and ch4) . On a HPC system where the provider is more optimized with less progress latency, I expect a less shm performance degradation from ch3 to ch4.

0 replies

hzhou · 2022-07-25T16:39:51Z

hzhou
Jul 25, 2022
Maintainer

It may be interesting if you compare the ch3 vs ch4 numbers on the macOS. Different architectures have different shm and sockets latency, which may exhibit different latency effects.

0 replies

dalcinl · 2022-07-25T19:33:18Z

dalcinl
Jul 25, 2022
Author

ch4 design polls both netmod progress and shm progress individually

Is there any easy way to disable netmod polling to verify (maybe commenting source code if needed)? Any pointers?

0 replies

dalcinl · 2022-07-25T19:36:53Z

dalcinl
Jul 25, 2022
Author

It may be interesting if you compare the ch3 vs ch4 numbers on the macOS. Different architectures have different shm and sockets latency, which may exhibit different latency effects.

I compared ch4 from Homebrew with ch3 from conda-forge on Apple Silicon (M1), and ch4 is faster by a factor of 3 for 64KiB messages. But take that with a grain of salt, as Homebrew is built with -O3 and conda-forge with -O2, the compiler version are not the same, etc. I should to try a manual builds of ch3 and ch4 with same optimization flags and compiler.

0 replies

hzhou · 2022-07-25T19:39:56Z

hzhou
Jul 25, 2022
Maintainer

ch4 design polls both netmod progress and shm progress individually

Is there any easy way to disable netmod polling to verify (maybe commenting source code if needed)? Any pointers?

I removed these lines in my testing -

mpich/src/mpid/ch4/src/ch4_progress.h

Lines 96 to 100 in 1c277e1

    
           if (state->flag & MPIDI_PROGRESS_NM) {                  \ 
        
               MPIDI_THREAD_CS_ENTER_VCI_OPTIONAL(vci);            \ 
        
               mpi_errno = MPIDI_NM_progress(vci, 0);              \ 
        
               MPIDI_THREAD_CS_EXIT_VCI_OPTIONAL(vci);                     \ 
        
           }                                                               \

0 replies

hzhou · 2022-07-25T19:41:42Z

hzhou
Jul 25, 2022
Maintainer

It may be interesting if you compare the ch3 vs ch4 numbers on the macOS. Different architectures have different shm and sockets latency, which may exhibit different latency effects.

I compared ch4 from Homebrew with ch3 from conda-forge on Apple Silicon (M1), and ch4 is faster by a factor of 3 for 64KiB messages. But take that with a grain of salt, as Homebrew is built with -O3 and conda-forge with -O2, the compiler version are not the same, etc. I should to try a manual builds of ch3 and ch4 with same optimization flags and compiler.

That's interesting.

0 replies

hzhou · 2022-07-25T19:48:58Z

hzhou
Jul 25, 2022
Maintainer

One thing to note is on X86 some atomic operations can be as fast as non-atomic operations, that can be a very significant factor in shm (esp. ch3) performance. I believe ARM atomic operations may be more expensive (because it cheats less).

0 replies

hzhou · 2022-08-01T14:05:53Z

hzhou
Aug 1, 2022
Maintainer

@dalcinl Is it okay to close this issue?

0 replies

dalcinl · 2022-08-01T14:36:26Z

dalcinl
Aug 1, 2022
Author

@dalcinl Is it okay to close this issue?

Well, if you guy do not plan to take any immediate action on it, then feel free to close it.

0 replies

hzhou · 2022-08-01T14:57:43Z

hzhou
Aug 1, 2022
Maintainer

I am afraid no immediate action can be taken. Let me convert this into a discussion, to accumulate priority while collecting ideas.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big latency between dynamic processes #6100

{{title}}

Replies: 14 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Big latency between dynamic processes #6100

dalcinl Jul 22, 2022

Replies: 14 comments

hzhou Jul 22, 2022 Maintainer

dalcinl Jul 22, 2022 Author

dalcinl Jul 22, 2022 Author

hzhou Jul 22, 2022 Maintainer

hzhou Jul 25, 2022 Maintainer

hzhou Jul 25, 2022 Maintainer

dalcinl Jul 25, 2022 Author

dalcinl Jul 25, 2022 Author

hzhou Jul 25, 2022 Maintainer

hzhou Jul 25, 2022 Maintainer

hzhou Jul 25, 2022 Maintainer

hzhou Aug 1, 2022 Maintainer

dalcinl Aug 1, 2022 Author

hzhou Aug 1, 2022 Maintainer

dalcinl
Jul 22, 2022

hzhou
Jul 22, 2022
Maintainer

dalcinl
Jul 22, 2022
Author

dalcinl
Jul 22, 2022
Author

hzhou
Jul 22, 2022
Maintainer

hzhou
Jul 25, 2022
Maintainer

hzhou
Jul 25, 2022
Maintainer

dalcinl
Jul 25, 2022
Author

dalcinl
Jul 25, 2022
Author

hzhou
Jul 25, 2022
Maintainer

hzhou
Jul 25, 2022
Maintainer

hzhou
Jul 25, 2022
Maintainer

hzhou
Aug 1, 2022
Maintainer

dalcinl
Aug 1, 2022
Author

hzhou
Aug 1, 2022
Maintainer