Big latency between dynamic processes #6100
-
See the attached reproducer pingpong.c.txt. This example is a Send+Recv (PingPong) test between a pair of MPI processes, considering the following two scenarios:
My workstation has the following CPU: AMD Ryzen Threadripper PRO 3995WX 64-Cores This is what I get running MPICH v3.4.3 (ch3:nemesis) from Fedora packages
This is what I get running MPICH main branch (ch4:ofi) with no explicit optimization flags to configure:
Similar story on macOS with Homebrew MPICH (v4.0.2, ch4:ofi)
What's going on? Is such large difference in performance expected? PS: We found this performance issue after running microbenchmarks for mpi4py.futures, which uses CC: @mrogowski |
Beta Was this translation helpful? Give feedback.
Replies: 14 comments
-
The communications with spawned processes go through netmod while COMM_WORLD uses shared memory for intra-node. The netmod performance depends on the libfabric provider. The default sockets provider is not particularly optimized. I am not sure why ch4 shared memory latency is so much higher in ch4 (18.63 vs 3.94). We'll investigate. |
Beta Was this translation helpful? Give feedback.
-
That's what I was suspecting. Would it be too much complicate for parent and children processes to connect via the faster shared memory channel if parent and child run on the same node? Dynamic process management has always been a sort of second class citizen in MPI world 😞. |
Beta Was this translation helpful? Give feedback.
-
I just did a manual build of |
Beta Was this translation helpful? Give feedback.
-
It will be much more complicated. If we need to put dynamic processes to the first class, we'll need to redesign the library, so maybe ch5. Every shift of focus in redesign, we gain something and lose something. The shared memory performance you are witnessing between ch3 and ch4 is of this nature. I think dynamic process in HPC will always be of second class. But maybe we can propose something in-between. For example, reserve comm-world(universe)-equivalent at launch/init, so we can pave the necessary structure at init, then the dynamic spawn at runtime can fit into the optimized structure. |
Beta Was this translation helpful? Give feedback.
-
@dalcinl Refer to my testing in #6072. |
Beta Was this translation helpful? Give feedback.
-
It may be interesting if you compare the |
Beta Was this translation helpful? Give feedback.
-
Is there any easy way to disable netmod polling to verify (maybe commenting source code if needed)? Any pointers? |
Beta Was this translation helpful? Give feedback.
-
I compared ch4 from Homebrew with ch3 from conda-forge on Apple Silicon (M1), and ch4 is faster by a factor of 3 for 64KiB messages. But take that with a grain of salt, as Homebrew is built with -O3 and conda-forge with -O2, the compiler version are not the same, etc. I should to try a manual builds of ch3 and ch4 with same optimization flags and compiler. |
Beta Was this translation helpful? Give feedback.
-
I removed these lines in my testing - mpich/src/mpid/ch4/src/ch4_progress.h Lines 96 to 100 in 1c277e1 |
Beta Was this translation helpful? Give feedback.
-
That's interesting. |
Beta Was this translation helpful? Give feedback.
-
One thing to note is on X86 some atomic operations can be as fast as non-atomic operations, that can be a very significant factor in shm (esp. |
Beta Was this translation helpful? Give feedback.
-
@dalcinl Is it okay to close this issue? |
Beta Was this translation helpful? Give feedback.
-
Well, if you guy do not plan to take any immediate action on it, then feel free to close it. |
Beta Was this translation helpful? Give feedback.
-
I am afraid no immediate action can be taken. Let me convert this into a discussion, to accumulate priority while collecting ideas. |
Beta Was this translation helpful? Give feedback.
The communications with spawned processes go through netmod while COMM_WORLD uses shared memory for intra-node. The netmod performance depends on the libfabric provider. The default sockets provider is not particularly optimized. I am not sure why ch4 shared memory latency is so much higher in ch4 (18.63 vs 3.94). We'll investigate.