Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add self hosted runner to get GPU testing on CSD3 #139

Closed
wants to merge 22 commits into from

Conversation

jagoosw
Copy link
Collaborator

@jagoosw jagoosw commented Sep 8, 2023

GitHub recently changed how they do self-hosted runners so I can run one on CSD3. This is a very hacky way of getting the tests to submit a slurm script and then wait for it to finish, and check if it worked.

I'm going to have a look at see if it is cleaner and not too much harder to setup buildkite like Oceananigans uses. I've realised that the Oceanannigans buildkite agents are individual computers not using slurm so will have to wait for that to be possible before we use buildkite.

We're also going to have the problem that the GPU nodes on CSD3 are always busy at the moment so it takes a long time for the jobs to start.

To do:

  • make it get a clean copy of the PR/repo at the start of the job
  • make it wait for the slurm job to finish
  • make it save the output of the tests so we can know what is wrong

@navidcy
Copy link
Collaborator

navidcy commented Sep 11, 2023

There is something wrong here? How is architecture defined? I'd expect something like:

using Test, CUDA, OceanBioME

architectures = CUDA.has_cuda() ? tuple(GPU()) : tuple(CPU())

for arch in architectures
    test_this_and_that(arch)
end

Then we only need a single runtest.jl?

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 14, 2023

There is something wrong here? How is architecture defined? I'd expect something like:

using Test, CUDA, OceanBioME

architectures = CUDA.has_cuda() ? tuple(GPU()) : tuple(CPU())

for arch in architectures
    test_this_and_that(arch)
end

Then we only need a single runtest.jl?

yeah, agree this PR is definitely not the best way todo these things.

Given the problems with getting GPU jobs completed, I've had another temporary idea for testing on GPU and am running a google colab notebook to test in. I might close this PR as I've made the useful changes also in #138

@jagoosw jagoosw closed this Sep 15, 2023
@glwagner
Copy link
Collaborator

If you want to run tests like Oceananigans, you could invest a bit of money in a local server with a GPU and run all your tests there via buildkite.

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 16, 2023

That's the solution we're going for!

@jagoosw jagoosw deleted the jsw/gpu-testing branch March 27, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants