Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aurora: Segfaults when message arrvives via shm memory #7203

Open
abagusetty opened this issue Nov 8, 2024 · 19 comments · May be fixed by #7217 or #7223
Open

Aurora: Segfaults when message arrvives via shm memory #7203

abagusetty opened this issue Nov 8, 2024 · 19 comments · May be fixed by #7217 or #7223
Labels

Comments

@abagusetty
Copy link

abagusetty commented Nov 8, 2024

Thanks to @raffenet @hzhou for figuring it out. Creating the reproducer that was created by @raffenet Needs an Aurora label. Adding info from internal slack:

I think we may have a recursive any_source cancel problem. When a message arrives via shm, we attempt to cancel the netmod partner request, but if that cancel fails we then try to cancel the shm partner? kaboom.

Also reproducible with upstream commits.
Backtrace from an app running with the commit: 204f8cd

#0  0x00001519e8def6e3 in MPIDIG_mpi_cancel_recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#1  0x00001519e8def53d in MPIDI_OFI_recv_event () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#2  0x00001519e8ded4fa in MPIDI_OFI_progress_uninlined () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#3  0x00001519e8dbcf4e in MPIDI_NM_mpi_cancel_recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#4  0x00001519e8dba0c9 in MPIDIG_send_target_msg_cb () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#5  0x00001519e8d43bf4 in MPIDI_SHM_progress () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#6  0x00001519e8d4366a in MPIDI_progress_test () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#7  0x00001519e8d40a3e in MPIR_Wait () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#8  0x00001519e8b59282 in PMPI_Recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#9  0x00001519fac944f9 in _progress_server ()
    at /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/exachemdev_mpipr_11-01-2024/TAMM/build_2024.07.30.002-agama996.26-gitmpich/GlobalArrays_External-prefix/src/GlobalArrays_External/comex/src-mpi-pr/comex.c:3429
#10 0x00001519fac81a38 in _comex_init (comm=comm@entry=1140850688)

Reproducer:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

#define COUNT 4

int main(void) {
  int ret;
  MPI_Init(NULL, NULL);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int x = 1000000;
  while (x-- > 0) {
    int buf[COUNT];
    if (rank == 0) {
      MPI_Status status1, status2;
      MPI_Recv(buf, COUNT, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status1);
      MPI_Recv(buf, COUNT, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status2);
    } else {
      MPI_Send(buf, COUNT, MPI_INT, 0, 0, MPI_COMM_WORLD);
    }
  }

  MPI_Finalize();

  return 0;
}
@hzhou hzhou added the aurora label Nov 8, 2024
@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

Thanks! @abagusetty

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

@abagusetty How did you get the backtrace? Do you have the location of the segfault?

@abagusetty
Copy link
Author

@hzhou The backtrace was generated from a core-dump on Aurora that segfaulted only at large node counts. I could run the app with debug version of mpich and get a better backtrace

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

@abagusetty Yeah, that will be helpful. I am curious on which line that segfaults.

@raffenet
Copy link
Contributor

raffenet commented Nov 8, 2024

Here's a full backtrace from a debug build of main. The request being canceled at step 0 is the one that was matched in step 12.

(gdb) bt
#0  0x0000147172408960 in MPIDIG_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/src/mpidig_recv.h:377
#1  0x00001471724097d5 in MPIDI_POSIX_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/shm/src/../posix/posix_recv.h:80
#2  0x000014717240885b in MPIDI_SHM_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/shm/src/shm_p2p.h:94
#3  0x0000147172407ae6 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
    is_cancelled=0x7ffc4845098c) at ./src/mpid/ch4/src/mpidig_request.h:130
#4  0x0000147172407453 in MPIDI_OFI_recv_event (vci=0, wc=0x7ffc48450a80,
    rreq=0x147172e84650 <MPIR_Request_direct+2480>, event_id=2)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:163
#5  0x000014717240719c in MPIDI_OFI_dispatch_optimized (vci=0, wc=0x7ffc48450a80,
    req=0x147172e84650 <MPIR_Request_direct+2480>) at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:205
#6  0x0000147172403a9b in MPIDI_OFI_handle_cq_entries (vci=0, wc=0x7ffc48450a50, num=2)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:61
#7  0x0000147172403273 in MPIDI_NM_progress (vci=0, made_progress=0x7ffc48450c08)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:105
#8  0x0000147172403047 in MPIDI_OFI_progress_uninlined (vci=0) at src/mpid/ch4/netmod/ofi/ofi_progress.c:13
#9  0x0000147172344321 in MPIDI_NM_mpi_cancel_recv (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
    is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:460
#10 0x0000147172343bd0 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e83e90 <MPIR_Request_direct+496>,
    is_cancelled=0x7ffc484510e8) at ./src/mpid/ch4/src/mpidig_request.h:108
#11 0x0000147172336be2 in match_posted_rreq (rank=1, tag=0, context_id=0, vci=0, is_local=true,
    req=0x7ffc48451158) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:225
#12 0x00001471723365f2 in MPIDIG_send_target_msg_cb (am_hdr=0x147157f168f0, data=0x147157f16920,
    in_data_sz=16, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:384
#13 0x00001471721e8219 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#14 0x00001471721e7eca in MPIDI_POSIX_progress (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#15 0x00001471721e7a68 in MPIDI_SHM_progress (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/shm_progress.h:18
#16 0x00001471721e6fbc in MPIDI_progress_test (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:142
#17 0x00001471721deafa in MPID_Progress_test (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:241
#18 0x00001471721e0525 in MPID_Progress_wait (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:296
#19 0x00001471721e0446 in MPIR_Wait_state (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
    status=0x7ffc4845175c, state=0x7ffc48451568) at src/mpi/request/request_impl.c:707
#20 0x00001471721e09ae in MPID_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
    status=0x7ffc4845175c) at ./src/mpid/ch4/src/ch4_wait.h:100
#21 0x00001471721e0868 in MPIR_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
--Type <RET> for more, q to quit, c to continue without paging--
    status=0x7ffc4845175c) at src/mpi/request/request_impl.c:750
#22 0x0000147171bbc58a in internal_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2,
    tag=0, comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:117
#23 0x0000147171bbb953 in PMPI_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2, tag=0,
    comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:169
#24 0x0000000000401d11 in main () at foo.c:20

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

I believe what was happening is: When shmem matches, it tries to call netmod cancel partner, but netmod can't cancel if it already matched, so it will instead cancel the shmem part.

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

@raffenet Can you confirm that line 377 is

 if (!MPIR_Request_is_complete(rreq) &&
        !MPIR_STATUS_GET_CANCEL_BIT(rreq->status) && !MPIDIG_REQUEST_IN_PROGRESS(rreq))

?
If so, I suspect it is segfaults in MPIDIG_REQUEST_IN_PROGRESS(rreq), due to the MPIDIG_REQUEST(rreq, req) already freed, maybe in MPIDIG_send_target_msg_cb

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

@raffenet If you try remove that branch altogether -- so it leaks -- will the test run?

EDIT: I guess we need the shmem cancel to work. How about just set the condition to true?

@raffenet
Copy link
Contributor

raffenet commented Nov 8, 2024

Yes MPIDIG_REQUEST(rreq, req) is NULL according to the backtrace. I'll try and remove the IN_PROGRESS check.

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

I guess it is somewhat a recursive situation. In

if (is_blocking) {
/* progress until the rreq completes, either with cancel-bit set,
* or with message received */
while (!MPIR_cc_is_complete(&rreq->cc)) {
mpi_errno = MPIDI_OFI_progress_uninlined(vci);
MPIR_ERR_CHECK(mpi_errno);
}
} else {
/* run progress once to prevent accumulating cq errors. */
mpi_errno = MPIDI_OFI_progress_uninlined(vci);
MPIR_ERR_CHECK(mpi_errno);
}
, maybe we should reset anysrc_partner before we call the progress.

@raffenet
Copy link
Contributor

raffenet commented Nov 8, 2024

I think we have to do it inside MPIDI_anysrc_try_cancel_partner. Once we have the partner request we can unset it's link back to the original request and then call cancel on it.

@hzhou
Copy link
Contributor

hzhou commented Nov 8, 2024

Give it a try? :)

@raffenet
Copy link
Contributor

raffenet commented Nov 8, 2024

I will. Lost my session, but this is my thought

diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
                  * ref count here to prevent free since here we will check
                  * the request status */
                 MPIR_Request_add_ref(anysrc_partner);
+                /* unset the partner request's partner to prevent recursive cancelation */
+                anysrc_parter->dev.anysrc_partner = NULL;
                 mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true);     /* blocking */
                 MPIR_ERR_CHECK(mpi_errno);
                 if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {

@raffenet
Copy link
Contributor

raffenet commented Nov 9, 2024

I will. Lost my session, but this is my thought

diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
                  * ref count here to prevent free since here we will check
                  * the request status */
                 MPIR_Request_add_ref(anysrc_partner);
+                /* unset the partner request's partner to prevent recursive cancelation */
+                anysrc_parter->dev.anysrc_partner = NULL;
                 mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true);     /* blocking */
                 MPIR_ERR_CHECK(mpi_errno);
                 if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {

This just causes a deadlock at the first anysrc partner cancel operation 😦. I'll try doing it in the netmod layer before calling it for the night.

raffenet added a commit to raffenet/mpich that referenced this issue Nov 14, 2024
Avoid recursively canceling partner requests for MPI_ANY_SOURCE recv
operations. Fixes pmodels#7203.
@raffenet raffenet linked a pull request Nov 14, 2024 that will close this issue
4 tasks
raffenet added a commit to raffenet/mpich that referenced this issue Nov 17, 2024
Avoid recursively canceling partner requests for MPI_ANY_SOURCE recv
operations. Fixes pmodels#7203.
@abagusetty
Copy link
Author

@raffenet I just ran the app on a 4k nodes of Aurora using your PR(built today) and hit a segfault with slightly different backtrace than the one posted above:

#0  0x0000149604324ca3 in MPIDIG_mpi_cancel_recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#1  0x0000149604326f0b in MPIDI_OFI_handle_cq_entries () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#2  0x00001496043268bc in MPIDI_NM_progress () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#3  0x0000149604325656 in MPIDI_progress_test () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#4  0x000014960432293e in MPIR_Wait () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#5  0x0000149604145fe2 in PMPI_Recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#6  0x0000149617a5f4e9 in _progress_server ()
    at /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/exachemdev_mpipr_11-22-2024/TAMM/build_2024.07.30.002-agama996.26-gitmpich/GlobalArrays_External-prefix/src/GlobalArrays_External/comex/src-mpi-pr/comex.c:3429
#7  0x0000149617a4ca28 in _comex_init (comm=comm@entry=1140850688)

The complaining API from the app side is still the same pointing to any_source usage.

@abagusetty
Copy link
Author

Not sure if I was prematurely testing the PR

@raffenet
Copy link
Contributor

@abagusetty thanks for trying it out. I will see if I can reproduce and update the PR if I find the issue.

@hzhou hzhou linked a pull request Nov 22, 2024 that will close this issue
4 tasks
@hzhou
Copy link
Contributor

hzhou commented Nov 22, 2024

@abagusetty Give #7223 a try.

@abagusetty
Copy link
Author

@hzhou @raffenet With #7223 , it worked with the app. Tested at 4k and 8k nodes of Aurora. Thanks so much for tracking this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants