-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Reduce when a process failed not really blocking #67
Comments
i forgot there is a drag function here here is the file |
Imagine an execution scenario where the process that is expected to die is extremely late, to the point where all other processes were able to send their contributions to the root (assume the reduction is implemented as a star, all processes sending their data directly to the root). Thus, at the moment where a process, has sent its contribution there was no known error, the sending completed successfully, and the process was able to get out of the You should be able to see the same behavior in |
Hi, this looks like normal behavior w.r.t. the outcome of the REDUCE call. You do not have a guarantee that the MPI_REDUCE will produce the same error code at all ranks by default. Multiple mitigations can be done here
|
Hello,
i am using "mpirun (Open MPI) 5.0.0rc6" with the latest fenix version from the master branch
MPI_Reduce lets some processes pass with init rank state and finish before the error starts being notified or handled, even though there is a failed process before. Which results in wrong sum the code in details if there is no barrier after the reduce.
This does not happen with MPI_Bcast & MPI_Allreduce.
If needed here is a link to code-file in c on easyupload will expire after a couple of weeks
The text was updated successfully, but these errors were encountered: