Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Buildkite version on Tartarus and Sverdrup #3751

Open
ali-ramadhan opened this issue Aug 30, 2024 · 4 comments
Open

Update Buildkite version on Tartarus and Sverdrup #3751

ali-ramadhan opened this issue Aug 30, 2024 · 4 comments
Labels
testing 🧪 Tests get priority in case of emergency evacuation

Comments

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Aug 30, 2024

Maybe this has already been extensively discussed but the GPU test suites on Buildkite fail often, requiring manual intervention to restart them for each PR.

The obvious solution is a bigger machine for testing, but I have two suggestions that are much easier to implement:

  1. Updating Buildkite. Newer versions may be more stable. The latest version is 3.79 but Sverdrup is on v3.24.0 (almost 4 years old) and Tartarus is on v3.50.4.
  2. If builds are failing due to too much resource competition, reducing the number of Buildkite agents on Sverdrup may help. Right now there are 16. I wonder if GPU builds will be more stable with 8-12. Some builds may be slower but if no one has to restart a test suite then that would make for a better developer experience.
@ali-ramadhan ali-ramadhan added the testing 🧪 Tests get priority in case of emergency evacuation label Aug 30, 2024
@glwagner
Copy link
Member

We think there is a race condition in the CI. Partly discussed on #3661 and also #3662, although one conclusions is that we should update to use the buildkite plugin (started on #3042)

@ali-ramadhan
Copy link
Member Author

Makes sense. I saw enough different errors that I couldn't pinpoint a particular issue. Feel free to close this issue if it's a duplicate of #3661.

@navidcy
Copy link
Collaborator

navidcy commented Aug 30, 2024

I see your struggles @ali-ramadhan! I've been struggling with this... Usually just restart the build kite and wish for the best...

@navidcy
Copy link
Collaborator

navidcy commented Aug 30, 2024

(often the CPU build kite acts up...)

@ali-ramadhan ali-ramadhan changed the title GPU test suites on Buildkite are flaky and fail often Update Buildkite version on Tartarus and Sverdrup Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing 🧪 Tests get priority in case of emergency evacuation
Projects
None yet
Development

No branches or pull requests

3 participants