Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(0.7.0) Fix all the GPU bugs that have crept in #138

Merged
merged 113 commits into from
Oct 3, 2023
Merged

Conversation

jagoosw
Copy link
Collaborator

@jagoosw jagoosw commented Sep 8, 2023

This PR fixes all of the GPU bugs that have crept in from not having CI on GPU. I've also modified all of the tests to run fine on GPU, and come up with a not too difficult process for testing on google colab as a short term solution.

Changed the ScaleNegativeTracers and SimpleMultiG APIs, and removes column_diffusion_timescale

Closes #142 #143 #144

@jagoosw jagoosw marked this pull request as ready for review September 9, 2023 18:26
@jagoosw jagoosw requested a review from navidcy September 9, 2023 18:26
@navidcy
Copy link
Collaborator

navidcy commented Sep 11, 2023

@jagoosw, does it make sense to wait for #139 before this? That way we'll have CI on GPU...

@navidcy navidcy added the GPU label Sep 11, 2023
@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 11, 2023

@navidcy I'm not sure that #139 is going to work because the availability of GPU nodes for the tests to run on is so low it's taking a really long time for jobs to start. John and I have talked about a solution to this but I don't think it's going to be sorted in the short term.

For now I have tested these changes manually on a GPU.

@navidcy
Copy link
Collaborator

navidcy commented Sep 11, 2023

OK, so all tests pass on GPU in this PR? Good to know! Then I'll have a look at this soon. (Possibly after OSM abstract deadline..)

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 11, 2023

OK, so all tests pass on GPU in this PR? Good to know! Then I'll have a look at this soon. (Possibly after OSM abstract deadline..)

I haven't done them all but tested the sediment, I will let you know when I've run all of the tests.

@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Attention: 63 lines in your changes are missing coverage. Please review.

Comparison is base (ac8419a) 66.04% compared to head (745501e) 63.00%.

❗ Current head 745501e differs from pull request most recent head 103c956. Consider uploading reports for the commit 103c956 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
- Coverage   66.04%   63.00%   -3.04%     
==========================================
  Files          27       28       +1     
  Lines        1066     1111      +45     
==========================================
- Hits          704      700       -4     
- Misses        362      411      +49     
Files Coverage Δ
src/Boundaries/gasexchange.jl 87.27% <ø> (+1.06%) ⬆️
src/Light/2band.jl 88.57% <ø> (ø)
src/Models/AdvectedPopulations/NPZD.jl 88.88% <100.00%> (+0.13%) ⬆️
src/Utils/sinking_velocity_fields.jl 74.19% <100.00%> (-0.81%) ⬇️
src/Utils/timestep.jl 0.00% <ø> (-81.82%) ⬇️
...c/Boundaries/Sediments/instant_remineralization.jl 79.16% <83.33%> (+1.89%) ⬆️
src/Boundaries/Sediments/simple_multi_G.jl 94.11% <96.29%> (+3.73%) ⬆️
src/BoxModel/boxmodel.jl 2.17% <0.00%> (-0.05%) ⬇️
src/Models/Individuals/SLatissima.jl 83.87% <66.66%> (-0.11%) ⬇️
src/OceanBioME.jl 70.96% <33.33%> (-1.45%) ⬇️
... and 4 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 11, 2023

Okay a lot of the stuff is not working on GPU now, we need this CliMA/Oceananigans.jl#3262 at least to make the negativity protection work again I think (the previous implementation that definitely worked on GPU defined these after the model definition so it wasn't a problem, but is now).

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 14, 2023

Okay so now I'm actually getting around to testing everything on GPU I've discovered some more issues. The first of these is that scaling negative tracers can't pass tuples of symbols to kernels as discussed here CliMA/Oceananigans.jl#3262, this is now solved.

The more difficult issue is that models are now often too large for the GPU parameter space (see JuliaGPU/CUDA.jl#2080), I think this can be solved by only passing the relevant part of the model to kernels.

My idea for this is to change adapt to return adapt(to, biogeochemistry.underlying_biogeochemistry) rather than the whole of biogeochemistry since all of the core Oceananigans uses of biogeochemistry in kernels only needs what is packaged in underlying biogeochemistry given that we discard the rest of it there anyway:

@inline biogeochemical_transition(i, j, k, grid, bgc::Biogeochemistry, val_tracer_name, clock, fields) =
biogeochemical_transition(i, j, k, grid, bgc.underlying_biogeochemistry, val_tracer_name, clock, fields)

We can then ensure that in all other instances we only pass the relevant parts, e.g. here change from passing bgc to passing bgc.underlying_biogeochemistry:
launch!(arch, model.grid, :xy,
_calculate_tendencies!,
sediment, bgc, model.grid, model.advection, model.tracers, model.timestepper)

I think this would solve most of our issues as e.g. the light attenuation model seems to always be a lot of parameter space but only actually be used in update state.

This is similar to how Oceananigans never passed models to kernels and hence doesn't have an adapt_structure method for it (but here we would need to have one that just returns the underlying model for all of the bits built into oceananigans).

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 14, 2023

My idea for this is to change adapt to return adapt(to, biogeochemistry.underlying_biogeochemistry) rather than the whole of biogeochemistry since all of the core Oceananigans uses of biogeochemistry in kernels only needs what is packaged in underlying biogeochemistry given that we discard the rest of it there anyway:

This looks like it's working! Getting different issues now at least

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 20, 2023

After propagating the changes from JuliaGPU/CUDA.jl#2080 all of the tests now pass on GPU! I guess we need to wait for CUDA to release a patch and then I will bump the compatibility on Oceananigans, then this should work.

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 20, 2023

Btw I'm trying todo GPU tests here: https://colab.research.google.com/drive/1V4Kj2iTtxjsU44c5CkAmJjjQKbqR3tYd?usp=sharing

Update on this, AWS Sagemake has much higher GPU availability if you just keep clicking start project, and much more transparent daily usage limits + you can just run a terminal in it

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 20, 2023

All tests now pass on GPU

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 23, 2023

I'm fairly sure that the test coverage isn't actually reducing in this PR and its just fallbacks + CodeCov not dealing with Kernels very well.

So I think this PR is ready now

@jagoosw jagoosw mentioned this pull request Sep 24, 2023
@jagoosw jagoosw mentioned this pull request Sep 25, 2023
5 tasks
Copy link
Collaborator

@johnryantaylor johnryantaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@jagoosw jagoosw merged commit 4cf6565 into main Oct 3, 2023
2 checks passed
@jagoosw jagoosw deleted the jsw/gpu-sediment-bug branch October 3, 2023 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

column_diffusion_timescale does not work on GPU
4 participants