Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing tests in ESPResSo v4.2.1 due to timeouts #363

Open
boegel opened this issue Oct 11, 2023 · 6 comments
Open

failing tests in ESPResSo v4.2.1 due to timeouts #363

boegel opened this issue Oct 11, 2023 · 6 comments

Comments

@boegel
Copy link
Contributor

boegel commented Oct 11, 2023

Some tests in the ESPResSo v4.2.1 test suite are known to be flaky, and sometimes hang, for example:

We ran into similar problem when building ESPResSo v4.2.1 for EESSI pilot 2023.06, cfr. #331 .

@boegel
Copy link
Contributor Author

boegel commented Oct 11, 2023

As mentioned in #331 (comment), it seems that hitting hanging tests is more likely on aarch64/neoverse_v1, since we didn't see hanging tests when building for the other CPU targets, but that could be dumb luck...

@jngrad Does this happen to ring any bells for you? Are you seeing hanging tests more often on certain platforms?

@boegel
Copy link
Contributor Author

boegel commented Oct 11, 2023

I've added a hook to #331 ignore the failing tests in ESPResSo v4.2.1 if they occur, that's the best we can do for now (other than not running the test suite at all, which is not a good idea imho), and updated the list of known issues to include this tracker issue, so we can get ESPResSo deployed in EESSI 2023.06...

@jngrad
Copy link

jngrad commented Oct 11, 2023

From my experience on Fedora Koji, our test cases aren't more prone to failure on neoverse compared to x86_64. However on architectures other than ARM and x86_64, we do see a lot of variability. For example when packaging on openSUSE, we ended up disabling every architecture but x86_64. See openSUSE:Factory/python3-espressomd and click on "Show 17 excluded/disabled results" to see the list.

@jngrad
Copy link

jngrad commented Oct 11, 2023

Here are all statistical tests: mass-and-rinertia_per_particle, rotational-diffusion-aniso, integrator_npt_stats.py, constant_pH_stats, langevin_thermostat_stats, brownian_dynamics_stats.py, dpd_stats, stokesian_dynamics.

They are known to take a large amount of time on our CI pipelines, because we run them concurrently and max out the host machine CPU resource usage via MPI oversubscription, so that hyperthreaded cores are fully used. This makes their runtime fluctuate wildly, with a negative feedback loop since they compete against one another for the same resources (e.g. if one test times out, there is a very good chance another unrelated test will time out too). More details can be found in espressomd/espresso#3883.

Having said that, you don't seem to run these tests concurrently, so your CI pipelines should not be experiencing the issue I just described. Maybe there is a deeper issue in ESPResSo's MPI code, unfortunately timeout information alone is not sufficient for me to investigate an MPI issue.

@boegel
Copy link
Contributor Author

boegel commented Jan 19, 2024

Interestingly, this problem did not pop up for the installation of ESPResSo v4.2.1 with foss/2023a in software.eessi.io, see #455 ...

@boegel
Copy link
Contributor Author

boegel commented Jan 19, 2024

Interestingly, this problem did not pop up for the installation of ESPResSo v4.2.1 with foss/2023a in software.eessi.io, see #455 ...

Scratch that, that's incorrect. We have a hook in place to ignore failing tests on neoverse_v1, we are still seeing timeouts (only for that CPU target).
For ESPResSo/4.2.1-foss-2023a:

The following tests FAILED:
          4 - test_checkpoint__therm_lb__p3m_cpu__lj__lb_cpu_ascii (Failed)
         34 - accumulator_correlator (Timeout)
         48 - interactions_bond_angle (Timeout)
         65 - rotation_per_particle (Timeout)
         66 - rotational_inertia (Timeout)
         71 - reaction_ensemble (Timeout)
         77 - canonical_ensemble (Timeout)
        100 - integrator_npt (Failed)
        101 - integrator_npt_stats (Failed)
        111 - lb_stats (Timeout)
        116 - dpd_stats (Timeout)
        124 - collision_detection (Timeout)
        151 - thermostats_anisotropic (Timeout)
        162 - lb_interpolation (Timeout)
        164 - oif_volume_conservation (Timeout)
        167 - lb_boundary (Timeout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants