Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py>=4.0 prevent the blockage from casampi.private.start_mpi calls #3

Open
r-xue opened this issue Oct 12, 2024 · 0 comments
Open

Comments

@r-xue
Copy link
Owner

r-xue commented Oct 12, 2024

Traditionally, the Python processes spawned from an mpicasa call split their roles when casampi.private.start_mpi is executed: rank 0 becomes the MPIclient, while non-rank 0 processes are placed in a holding pattern as MPIServers.

It appears that builds using conda-forge mpi4py>=4.0+openmpi=5.0.4 alter this behavior, as non-rank 0 processes no longer assume their server roles after the casampi.private.start_mpi call. I have confirmed that this issue arises solely due to the mpi4py version bump in both CASA version 6.6.1 and 6.6.6.

(pipe1669py38) rxue@xenon:~/Workspace/nvme/nrao/tickets/PIPE-1669/working$ casa6mpi_xvfb pipeline_flag -c ../scripts/test_working.py

======================   ALLOCATED NODES   ======================
    xenon: slots=1 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
        aliases: xenon
=================================================================

======================   ALLOCATED NODES   ======================
    xenon: slots=18 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: xenon
=================================================================

======================   ALLOCATED NODES   ======================
    xenon: slots=18 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: xenon
=================================================================

========================   JOB MAP   ========================
Data for JOB prterun-xenon-3573009@1 offset 0 Total slots allocated 18
    Mapping policy: BYCORE:OVERSUBSCRIBE  Ranking policy: FILL Binding policy: NONE
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: xenon    Num slots: 18   Max slots: 0    Num procs: 8
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 0 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 1 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 2 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 3 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 4 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 5 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 6 Bound: N/A
        Process jobid: prterun-xenon-3573009@1 App: 0 Process rank: 7 Bound: N/A

=============================================================

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py

Using configuration file ~/.casa/config.py
Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

Using user-supplied startup.py at ~/.casa/startup.py

No event loop hook running.
No event loop hook running.
No event loop hook running.
No event loop hook running.
No event loop hook running.
No event loop hook running.
No event loop hook running.
No event loop hook running.
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
/zfs/nvme/Workspace/nrao/tickets/PIPE-1669/working
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Start a pipeline processing session:
casalog.num_cpus:        36
casalog.total_memory:    131798098
casalog.omp_num_threads: 1
is_mpi_enabled:          False
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
openmpi version:         ('Open MPI', (5, 0, 4))
--------------------------------------------------------------------------------
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa  Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa  Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2
2024-10-12 20:16:52     INFO    ::casa
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa  Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
CASA 6.6.1.15 -- Common Astronomy Software Applications
CASA 6.6.1.15 -- Common Astronomy Software Applications
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3     Using configuration file ~/.casa/config.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3     Using user-supplied startup.py at ~/.casa/startup.py
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4     Checking Measures tables in data repository sub-directory /home/rxue/Workspace/nvme/nrao/casa_dist/casarundata/geodetic
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa    IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2       IERSeop2000 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa    IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa    IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2       IERSeop97 (version date, last date in table (UTC)): 2024/10/05/15:00, 2024/09/05/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-6       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-3       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-7       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-5       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-4       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa    TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-1       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2       IERSpredict (version date, last date in table (UTC)): 2024/10/11/15:00, 2025/01/09/00:00:00
2024-10-12 20:16:52     INFO    ::casa::MPIServer-2       TAI_UTC (version date, last date in table (UTC)): 2024/09/29/15:00, 2017/01/01/00:00:00
working? False
working? False
working? False
working? False
working? False
working? False
working? False
working? False
--------------------------------------------------------------------------

As an interim solution, we should continue using mpi4py<4+openmpi=5.0.3 for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant