Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failures due to jumbled output in TpetraCore_MatrixMarket_Tpetra_CrsMatrix_Dist_Binary_simple_MPI_1 breaking PR builds starting 2022-07-08 #10898

Open
bartlettroscoe opened this issue Aug 18, 2022 · 10 comments
Labels
DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/tpetra, @tasmith4

Description

As shown in this query (click "Shown Matching Output" in upper right) the test:

  • TpetraCore_MatrixMarket_Tpetra_CrsMatrix_Dist_Binary_simple_MPI_1

is randomly failing in the builds:

  • PR-10706-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-380
  • PR-10751-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-877
  • PR-10808-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-929

started testing day 2022-07-08.

Just like for the Tpetra tests reported in issue #10885, these failures are caused by jumbled output breaking up the printing of End Result: TEST PASSED like shown here showing:

End RKokkos::Cuda::Cuda instance constructor : ERROR device not initialized
Kokkos::Cuda::Cuda instance constructor : ERROR device not initialized
esult: TEST PASSED

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

It is a randomly failing test so it will be hard to reproduce.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Aug 18, 2022
@bartlettroscoe
Copy link
Member Author

FYI: This was the only test failure which took out the last iteration if my PR build #10808 (comment). I have been trying to get that PR build to pass PR testing for going on 3 weeks now and random Tpetra test failures have taken out several of those iterations.

@tasmith4
Copy link
Contributor

@bartlettroscoe this is a little different from the last one, since it's not output deliberately printed by Tpetra. I've noticed it before on other projects as well, but I'm not exactly sure what the root cause is. I'm reaching out to the Kokkos team for more information on this.

@tasmith4
Copy link
Contributor

@bartlettroscoe from my conversation on the Kokkos slack, it sounds like this is actually a Kokkos bug, which was resolved in kokkos/kokkos#5151. This fix will be available in Kokkos 3.7 -- I'll leave it up to you whether it's better to wait for Kokkos 3.7 to make it into Trilinos or pull the fix over now.

@bartlettroscoe
Copy link
Member Author

I'll leave it up to you whether it's better to wait for Kokkos 3.7 to make it into Trilinos or pull the fix over now.

@tasmith4, I think it can wait for the Kokkos upgrade.

However, it would be good to know how many Tpetra tests are failing due to jumbled output. It occurred to me how to search for that and I think this query does that which shows:

image

So between this issue and #10885, I think that catches them all.

@csiefer2
Copy link
Member

FYI - Trilinos PR for Kokkos/KokkosKernels update is supposed to get put in this week (as per Nathan).

@bartlettroscoe
Copy link
Member Author

@tasmith4, @csiefer2, what might help is to carefully flush the streams before and after printing End Result: TEST PASSSED. If you are only outputting from one MPI rank then that may eliminate the jumbled output problem.

@tasmith4
Copy link
Contributor

@bartlettroscoe I think for most if not all tests we just write to the Teuchos unit test "out" stream, and a lot of that stuff gets handled however the Teuchos unit testing framework/command line options specify (I've never dug super deep into that).

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe I think for most if not all tests we just write to the Teuchos unit test "out" stream, and a lot of that stuff gets handled however the Teuchos unit testing framework/command line options specify (I've never dug super deep into that).

Right, but that is just a stream. Perhaps we should create a function in TriBITS called Tribits:printEndResultTestPassed() that will do the proper flushing and only print on the root process?

@tasmith4
Copy link
Contributor

Perhaps we should create a function in TriBITS called Tribits:printEndResultTestPassed() that will do the proper flushing and only print on the root process?

I could go for that. Could be a lot of work to retrofit existing tests though.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 19, 2023
@csiefer2 csiefer2 added DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. and removed MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. labels Sep 12, 2023
@jhux2 jhux2 added this to Tpetra Aug 12, 2024
@jhux2 jhux2 moved this to Backlog in Tpetra Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Backlog
Development

No branches or pull requests

3 participants