-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(0.7.0) Fix all the GPU bugs that have crept in #138
Conversation
@navidcy I'm not sure that #139 is going to work because the availability of GPU nodes for the tests to run on is so low it's taking a really long time for jobs to start. John and I have talked about a solution to this but I don't think it's going to be sorted in the short term. For now I have tested these changes manually on a GPU. |
OK, so all tests pass on GPU in this PR? Good to know! Then I'll have a look at this soon. (Possibly after OSM abstract deadline..) |
I haven't done them all but tested the sediment, I will let you know when I've run all of the tests. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
==========================================
- Coverage 66.04% 63.00% -3.04%
==========================================
Files 27 28 +1
Lines 1066 1111 +45
==========================================
- Hits 704 700 -4
- Misses 362 411 +49
☔ View full report in Codecov by Sentry. |
Okay a lot of the stuff is not working on GPU now, we need this CliMA/Oceananigans.jl#3262 at least to make the negativity protection work again I think (the previous implementation that definitely worked on GPU defined these after the model definition so it wasn't a problem, but is now). |
Okay so now I'm actually getting around to testing everything on GPU I've discovered some more issues. The first of these is that scaling negative tracers can't pass tuples of symbols to kernels as discussed here CliMA/Oceananigans.jl#3262, this is now solved. The more difficult issue is that models are now often too large for the GPU parameter space (see JuliaGPU/CUDA.jl#2080), I think this can be solved by only passing the relevant part of the model to kernels. My idea for this is to change adapt to return OceanBioME.jl/src/OceanBioME.jl Lines 120 to 121 in ac8419a
We can then ensure that in all other instances we only pass the relevant parts, e.g. here change from passing bgc to passing bgc.underlying_biogeochemistry :OceanBioME.jl/src/Boundaries/Sediments/Sediments.jl Lines 38 to 40 in ac8419a
I think this would solve most of our issues as e.g. the light attenuation model seems to always be a lot of parameter space but only actually be used in update state. This is similar to how Oceananigans never passed models to kernels and hence doesn't have an |
This looks like it's working! Getting different issues now at least |
After propagating the changes from JuliaGPU/CUDA.jl#2080 all of the tests now pass on GPU! I guess we need to wait for CUDA to release a patch and then I will bump the compatibility on Oceananigans, then this should work. |
Update on this, AWS Sagemake has much higher GPU availability if you just keep clicking start project, and much more transparent daily usage limits + you can just run a terminal in it |
[skip ci]
All tests now pass on GPU |
I'm fairly sure that the test coverage isn't actually reducing in this PR and its just fallbacks + CodeCov not dealing with Kernels very well. So I think this PR is ready now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
This PR fixes all of the GPU bugs that have crept in from not having CI on GPU. I've also modified all of the tests to run fine on GPU, and come up with a not too difficult process for testing on google colab as a short term solution.
Changed the
ScaleNegativeTracers
andSimpleMultiG
APIs, and removescolumn_diffusion_timescale
Closes #142 #143 #144