Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: OSC UCX component priority set inside component query failed #248

Open
cbehan opened this issue Jul 6, 2024 · 7 comments
Open

Error: OSC UCX component priority set inside component query failed #248

cbehan opened this issue Jul 6, 2024 · 7 comments
Assignees
Milestone

Comments

@cbehan
Copy link
Contributor

cbehan commented Jul 6, 2024

With pycftboot, it seems easy to generate XML files which need to be processed with pmp2sdp -f json instead of pmp2sdp -f bin. The simplest testcase I could make is attached. In json mode it works perfectly but in bin mode it results in:

/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator [with _Tp = El::BigFloat; _Alloc = std::allocatorEl::BigFloat; reference = El::BigFloat&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
mySDP.xml.gz

@vasdommes vasdommes self-assigned this Jul 8, 2024
@vasdommes
Copy link
Collaborator

Hi @cbehan, could you describe in more details how do you call pmp2sdp and sdpb?

For me, the following commands gave the same maxComplementarity exceeded after 71 iterations (the numbers coincide up to rounding errors):

# binary format
./build/pmp2sdp --precision 1024 -i tmp/mySDP.xml -f bin -o tmp/bin/sdp
./build/sdpb --precision 1024 --sdpDir tmp/bin/sdp --outDir tmp/bin/out

# json format
./build/pmp2sdp --precision 1024 -i tmp/mySDP.xml -f json -o tmp/json/sdp
./build/sdpb --precision 1024 --sdpDir tmp/json/sdp --outDir tmp/json/out

There are two known reasons why binary format can fail:

  1. pmp2sdp and sdpb are run with different --precision. This should cause Assertion 'precision == El::gmp::Precision()' failed.
  2. pmp2sdp and sdpb are run in different environment (e.g. CPU architecture or Boost version). Note that our binary SDP format is built on top of Boost.Serialization format which is not portable.

@cbehan
Copy link
Contributor Author

cbehan commented Jul 9, 2024

Hi @vasdommes. It seems I spoke to soon. The difference between bin and json that I claimed is not reproducible. I think when I tried bin, I was also requesting more cores than when I tried json.

The stream attached is now a controlled test. With 2 cores, bin and json work. But with 3 they both fail. There must be an issue when the number of blocks is too small to be split between all of MPI's cores. Since SDPB is designed for big problems (and it's easy to pass -n 1 when you have a small one) perhaps there is no need to fix this.
stream.txt

@vasdommes
Copy link
Collaborator

@cbehan thanks for the updates!

Your example works fine on my machine with 3 cores.

Generally, SDPB should work fine when the number of cores exceeds number of blocks (it should fail only in the extreme case when the number of blocks exceeds the number of nodes). So it's unclear what went wrong for you.

@vasdommes
Copy link
Collaborator

Let's take a look at the error messages:

[connor-laptop:07950] osc_ucx_component.c:369  Error: OSC UCX component priority set inside component query failed 

This error repeats many times. It is thrown somewhere from OpenMPI code.
I couldn't find more details on the internet, except for the source code and one unanswered question mentioning MPI_Win_Allocate.

/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = El::BigFloat; _Alloc = std::allocator<El::BigFloat>; reference = El::BigFloat&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Out-of-range error for std::vector<El::BigFloat>. Maybe data is already invalid due to MPI errors that happened earlier.

[connor-laptop:07949] *** Process received signal ***
[connor-laptop:07949] Signal: Aborted (6)
[connor-laptop:07949] Signal code:  (-6)
[connor-laptop:07949] [ 0] /usr/lib/libc.so.6(+0x3cae0)[0x7b836ea50ae0]
[connor-laptop:07949] [ 1] /usr/lib/libc.so.6(+0x94e44)[0x7b836eaa8e44]
[connor-laptop:07949] [ 2] /usr/lib/libc.so.6(gsignal+0x20)[0x7b836ea50a30]
[connor-laptop:07949] [ 3] /usr/lib/libc.so.6(abort+0xdf)[0x7b836ea384c3]
[connor-laptop:07949] [ 4] /usr/lib/libstdc++.so.6(_ZNSt6chrono3_V212system_clock3nowEv+0x0)[0x7b836ecd2d60]
[connor-laptop:07949] [ 5] /usr/lib/libEl.so.0(_ZN2El4copy21TranslateBetweenGridsINS_8BigFloatEEEvRKNS_10DistMatrixIT_LNS_6DistNS4DistE0ELS6_2ELNS_10DistWrapNS8DistWrapE0EEERS9_+0x572)[0x7b8373e73172]
[connor-laptop:07949] [ 6] /usr/lib/libEl.so.0(_ZN2El10DistMatrixINS_8BigFloatELNS_6DistNS4DistE0ELS3_2ELNS_10DistWrapNS8DistWrapE0EEaSERKS6_+0x14)[0x7b8373e754c4]
[connor-laptop:07949] [ 7] /usr/lib/libEl.so.0(_ZN2El16HermitianTridiagINS_8BigFloatEEEvNS_14UpperOrLowerNS12UpperOrLowerERNS_18AbstractDistMatrixIT_EES7_RKNS_20HermitianTridiagCtrlIS5_EE+0x387)[0x7b83744b8a97]
[connor-laptop:07949] [ 8] /usr/lib/libEl.so.0(_ZN2El12herm_tridiag17ExplicitCondensedINS_8BigFloatEEEvNS_14UpperOrLowerNS12UpperOrLowerERNS_18AbstractDistMatrixIT_EERKNS_20HermitianTridiagCtrlIS6_EE+0x55)[0x7b83744b8d85]
[connor-laptop:07949] [ 9] /usr/lib/libEl.so.0(_ZN2El8herm_eig8BlackBoxINS_8BigFloatEEENS_16HermitianEigInfoENS_14UpperOrLowerNS12UpperOrLowerERNS_18AbstractDistMatrixIT_EERNS6_INS_10BaseHelperIS7_E4typeEEERKNS_16HermitianEigCtrlIS7_EE+0x61a)[0x7b8374c1079a]
[connor-laptop:07949] [10] /usr/lib/libEl.so.0(_ZN2El12HermitianEigINS_8BigFloatEEENS_16HermitianEigInfoENS_14UpperOrLowerNS12UpperOrLowerERNS_18AbstractDistMatrixIT_EERNS5_INS_10BaseHelperIS6_E4typeEEERKNS_16HermitianEigCtrlIS6_EE+0x15d)[0x7b8374c14dcd]
[connor-laptop:07949] [11] /usr/bin/sdpb(+0x122ac3)[0x5b9a9ab2dac3]
[connor-laptop:07949] [12] /usr/bin/sdpb(+0x122555)[0x5b9a9ab2d555]
[connor-laptop:07949] [13] /usr/bin/sdpb(+0xf7e9e)[0x5b9a9ab02e9e]
[connor-laptop:07949] [14] /usr/bin/sdpb(+0xebdc4)[0x5b9a9aaf6dc4]
[connor-laptop:07949] [15] /usr/bin/sdpb(+0x7691c)[0x5b9a9aa8191c]
[connor-laptop:07949] [16] /usr/bin/sdpb(+0x568f3)[0x5b9a9aa618f3]
[connor-laptop:07949] [17] /usr/lib/libc.so.6(+0x25c88)[0x7b836ea39c88]
[connor-laptop:07949] [18] /usr/lib/libc.so.6(__libc_start_main+0x8c)[0x7b836ea39d4c]
[connor-laptop:07949] [19] /usr/bin/sdpb(+0x66a85)[0x5b9a9aa71a85]
[connor-laptop:07949] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 7949 on node connor-laptop exited on
signal 6 (Aborted).

This comes from this line in SDPB:

El::HermitianEig(El::UpperOrLowerNS::LOWER, block, eigenvalues,

It is interesting that SDPB did not crash immediately but did some nontrivial computations before Signal: Aborted (6) is raised.
Specifically, we arrived at least to this line (note that update_cond_numbers() above contains a synchronization point):

= step_length(X_cholesky, dX, parameters.step_length_reduction,

So, it looks like some OpenMPI configuration issue or an obscure Elemental bug.

@vasdommes vasdommes changed the title Drastically different results with bin vs json format Error: OSC UCX component priority set inside component query failed Jul 10, 2024
@vasdommes vasdommes removed the io label Jul 10, 2024
@vasdommes vasdommes added this to the Backlog milestone Jul 10, 2024
@vasdommes
Copy link
Collaborator

If the problem is specific to OSC UCX component in OpenMPI, using another OSC component might help.
See https://docs.open-mpi.org/en/main/mca.html#command-line-parameters

@cbehan could you try the following?

  1. Run SDPB with disabled OSC UCX:
mpirun --mca osc ^ucx -n 3 /usr/bin/sdpb --precision=1024 -s mySDP
  1. If this doesn't help, run ompi_info | grep osc to see all available OSC components:
$ ompi_info | grep osc
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.2)

and then try running SDPB with each OSC component (e.g. sm, pt2pt, monitoring, ucx, or rdma), for example:

mpirun --mca osc sm -n 3 /usr/bin/sdpb --precision=1024 -s mySDP

On my laptop, --mca osc sm works, whereas everything else (e.g. --mca osc ucx) fails with An error occurred in MPI_Win_allocate_shared.

@cbehan
Copy link
Contributor Author

cbehan commented Jul 15, 2024

Thanks for the tips. I have four of those five OSC components.

    ompi_info | grep osc
    MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.3)
    MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v5.0.3)
    MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.3)
    MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.3)

They all fail but in different ways and the error output is attached.
error_other.txt
error_sm.txt
error_ucx.txt

@vasdommes
Copy link
Collaborator

@cbehan what OS are you using?

In an attempt to reproduce the error, I've installed OpenMPI 5.0.5 from sources according to the official instruction (with default parameters) on my WSL + Ubuntu 22.04, rebuilt Elemental and SDPB with this version. But it works fine for me.

For the reference, installation process looked (more or less) like this:

cd $HOME
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xf openmpi-5.0.5.tar.gz
cd openmpi-5.0.5/
./configure --prefix=$HOME/install
make all
make install

cd $HOME/elemental
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/install -DCMAKE_CXX_COMPILER=$HOME/install/bin/mpicxx -DCMAKE_C_COMPILER=$HOME/install/bin/mpicc
make && make install

cd $HOME/sdpb
CXX=$HOME/install/bin/mpicxx ./waf configure --elemental-dir=$HOME/install --flint-dir=$HOME/install

$HOME/install/bin/mpirun -n 3 ./build/sdpb --precision 1024 --sdpDir tmp/bin/sdp --outDir tmp/bin/out --noFinalCheckpoint

OSC components available:

$ $HOME/install/bin/ompi_info | grep osc
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.5)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v5.0.5)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.5)

sm is the default and works fine, monitoring and rdma fail with An error occurred in MPI_Win_allocate_shared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants