(0.7.0) Fix all the GPU bugs that have crept in #138

jagoosw · 2023-09-08T10:17:21Z

This PR fixes all of the GPU bugs that have crept in from not having CI on GPU. I've also modified all of the tests to run fine on GPU, and come up with a not too difficult process for testing on google colab as a short term solution.

Changed the ScaleNegativeTracers and SimpleMultiG APIs, and removes column_diffusion_timescale

Closes #142 #143 #144

navidcy · 2023-09-11T09:52:03Z

@jagoosw, does it make sense to wait for #139 before this? That way we'll have CI on GPU...

jagoosw · 2023-09-11T10:26:00Z

@navidcy I'm not sure that #139 is going to work because the availability of GPU nodes for the tests to run on is so low it's taking a really long time for jobs to start. John and I have talked about a solution to this but I don't think it's going to be sorted in the short term.

For now I have tested these changes manually on a GPU.

navidcy · 2023-09-11T10:27:14Z

OK, so all tests pass on GPU in this PR? Good to know! Then I'll have a look at this soon. (Possibly after OSM abstract deadline..)

jagoosw · 2023-09-11T10:28:12Z

OK, so all tests pass on GPU in this PR? Good to know! Then I'll have a look at this soon. (Possibly after OSM abstract deadline..)

I haven't done them all but tested the sediment, I will let you know when I've run all of the tests.

codecov · 2023-09-11T16:32:48Z

Codecov Report

Attention: 63 lines in your changes are missing coverage. Please review.

Comparison is base (ac8419a) 66.04% compared to head (745501e) 63.00%.

❗ Current head 745501e differs from pull request most recent head 103c956. Consider uploading reports for the commit 103c956 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
- Coverage   66.04%   63.00%   -3.04%     
==========================================
  Files          27       28       +1     
  Lines        1066     1111      +45     
==========================================
- Hits          704      700       -4     
- Misses        362      411      +49

Files	Coverage Δ
src/Boundaries/gasexchange.jl	`87.27% <ø> (+1.06%)`	⬆️
src/Light/2band.jl	`88.57% <ø> (ø)`
src/Models/AdvectedPopulations/NPZD.jl	`88.88% <100.00%> (+0.13%)`	⬆️
src/Utils/sinking_velocity_fields.jl	`74.19% <100.00%> (-0.81%)`	⬇️
src/Utils/timestep.jl	`0.00% <ø> (-81.82%)`	⬇️
...c/Boundaries/Sediments/instant_remineralization.jl	`79.16% <83.33%> (+1.89%)`	⬆️
src/Boundaries/Sediments/simple_multi_G.jl	`94.11% <96.29%> (+3.73%)`	⬆️
src/BoxModel/boxmodel.jl	`2.17% <0.00%> (-0.05%)`	⬇️
src/Models/Individuals/SLatissima.jl	`83.87% <66.66%> (-0.11%)`	⬇️
src/OceanBioME.jl	`70.96% <33.33%> (-1.45%)`	⬇️
... and 4 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jagoosw · 2023-09-11T16:38:19Z

Okay a lot of the stuff is not working on GPU now, we need this CliMA/Oceananigans.jl#3262 at least to make the negativity protection work again I think (the previous implementation that definitely worked on GPU defined these after the model definition so it wasn't a problem, but is now).

jagoosw · 2023-09-14T16:06:40Z

Okay so now I'm actually getting around to testing everything on GPU I've discovered some more issues. The first of these is that scaling negative tracers can't pass tuples of symbols to kernels as discussed here CliMA/Oceananigans.jl#3262, this is now solved.

The more difficult issue is that models are now often too large for the GPU parameter space (see JuliaGPU/CUDA.jl#2080), I think this can be solved by only passing the relevant part of the model to kernels.

My idea for this is to change adapt to return adapt(to, biogeochemistry.underlying_biogeochemistry) rather than the whole of biogeochemistry since all of the core Oceananigans uses of biogeochemistry in kernels only needs what is packaged in underlying biogeochemistry given that we discard the rest of it there anyway:

OceanBioME.jl/src/OceanBioME.jl

Lines 120 to 121 in ac8419a

    
           @inline biogeochemical_transition(i, j, k, grid, bgc::Biogeochemistry, val_tracer_name, clock, fields) = 
        
               biogeochemical_transition(i, j, k, grid, bgc.underlying_biogeochemistry, val_tracer_name, clock, fields)

We can then ensure that in all other instances we only pass the relevant parts, e.g. here change from passing bgc to passing bgc.underlying_biogeochemistry:

OceanBioME.jl/src/Boundaries/Sediments/Sediments.jl

Lines 38 to 40 in ac8419a

    
           launch!(arch, model.grid, :xy, 
        
                   _calculate_tendencies!, 
        
                   sediment, bgc, model.grid, model.advection, model.tracers, model.timestepper)

I think this would solve most of our issues as e.g. the light attenuation model seems to always be a lot of parameter space but only actually be used in update state.

This is similar to how Oceananigans never passed models to kernels and hence doesn't have an adapt_structure method for it (but here we would need to have one that just returns the underlying model for all of the bits built into oceananigans).

jagoosw · 2023-09-14T16:59:23Z

My idea for this is to change adapt to return adapt(to, biogeochemistry.underlying_biogeochemistry) rather than the whole of biogeochemistry since all of the core Oceananigans uses of biogeochemistry in kernels only needs what is packaged in underlying biogeochemistry given that we discard the rest of it there anyway:

This looks like it's working! Getting different issues now at least

jagoosw · 2023-09-20T16:22:41Z

After propagating the changes from JuliaGPU/CUDA.jl#2080 all of the tests now pass on GPU! I guess we need to wait for CUDA to release a patch and then I will bump the compatibility on Oceananigans, then this should work.

jagoosw · 2023-09-20T16:33:34Z

Btw I'm trying todo GPU tests here: https://colab.research.google.com/drive/1V4Kj2iTtxjsU44c5CkAmJjjQKbqR3tYd?usp=sharing

Update on this, AWS Sagemake has much higher GPU availability if you just keep clicking start project, and much more transparent daily usage limits + you can just run a terminal in it

[skip ci]

jagoosw · 2023-09-20T22:16:07Z

All tests now pass on GPU

jagoosw · 2023-09-23T21:45:12Z

I'm fairly sure that the test coverage isn't actually reducing in this PR and its just fallbacks + CodeCov not dealing with Kernels very well.

So I think this PR is ready now

johnryantaylor

lgtm, thanks!

jagoosw added 8 commits September 7, 2023 17:12

fixed gpu incompatability in calculate_bottom_indices

6583027

fixed adapt method for instant remineralisation (oops)

20524a9

and again

cebc1fa

maybe fixed sediment

13b0405

maybe fixed sediment again

29514bf

Merge remote-tracking branch 'origin' into jsw/gpu-sediment-bug

450b6a9

maybe fixed

d9a210a

fixed multi g

a67c79e

jagoosw marked this pull request as ready for review September 9, 2023 18:26

jagoosw requested a review from navidcy September 9, 2023 18:26

navidcy added the GPU label Sep 11, 2023

jagoosw added 3 commits September 11, 2023 15:50

added GPU detection to tests

125cf0a

oops

ce514c0

fixed test labelling

67f9365

jagoosw added 2 commits September 14, 2023 13:18

maybe fixed scaling

ff0e343

tidying

3c8edb6

jagoosw added 3 commits September 14, 2023 17:36

idea

e0d4e7d

particles fix

b32f4fd

maybe fixed

df20a2f

jagoosw added 3 commits September 14, 2023 18:01

a fix

5d4944f

kelp please work

6b40a92

oops wrong model

834d18c

updated broken tests

21622ff

jagoosw added 12 commits September 20, 2023 17:38

typo

d3f0882

sign error

2351d64

fixed sediment docs

b8e1af3

removed unneccesary Array(interior(...))

2cdee7b

flattened sinking velocities

bc6bb3e

reduced complexity of sediment models further

9ccda06

typo

e9f9206

another mistake

36d2741

[skip ci]

typo

70cb036

[skip ci]

tidy up

13314f0

minor issue with instant remin

aecd5b7

minor issue with instant remin

3b953dd

jagoosw added 4 commits September 21, 2023 11:46

updated model implementation page

2dfb211

typo

3ddd947

Remove redundant fallbacks

6b7b3fb

generalised sinking tracers for NPZD

cd99299

jagoosw mentioned this pull request Sep 24, 2023

Improve Eady figure #145

Merged

bump oceananigans compat entry

745501e

jagoosw mentioned this pull request Sep 25, 2023

JOSS paper #76

Merged

5 tasks

jagoosw requested review from glwagner and johnryantaylor October 1, 2023 18:49

removed accidentally committed new files

103c956

johnryantaylor approved these changes Oct 3, 2023

View reviewed changes

jagoosw merged commit 4cf6565 into main Oct 3, 2023
2 checks passed

jagoosw deleted the jsw/gpu-sediment-bug branch October 3, 2023 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(0.7.0) Fix all the GPU bugs that have crept in #138

(0.7.0) Fix all the GPU bugs that have crept in #138

jagoosw commented Sep 8, 2023 •

edited

Loading

navidcy commented Sep 11, 2023

jagoosw commented Sep 11, 2023

navidcy commented Sep 11, 2023

jagoosw commented Sep 11, 2023

codecov bot commented Sep 11, 2023 •

edited

Loading

jagoosw commented Sep 11, 2023 •

edited

Loading

jagoosw commented Sep 14, 2023 •

edited

Loading

jagoosw commented Sep 14, 2023 •

edited

Loading

jagoosw commented Sep 20, 2023 •

edited

Loading

jagoosw commented Sep 20, 2023 •

edited

Loading

jagoosw commented Sep 20, 2023

jagoosw commented Sep 23, 2023

johnryantaylor left a comment

(0.7.0) Fix all the GPU bugs that have crept in #138

(0.7.0) Fix all the GPU bugs that have crept in #138

Conversation

jagoosw commented Sep 8, 2023 • edited Loading

navidcy commented Sep 11, 2023

jagoosw commented Sep 11, 2023

navidcy commented Sep 11, 2023

jagoosw commented Sep 11, 2023

codecov bot commented Sep 11, 2023 • edited Loading

Codecov Report

jagoosw commented Sep 11, 2023 • edited Loading

jagoosw commented Sep 14, 2023 • edited Loading

jagoosw commented Sep 14, 2023 • edited Loading

jagoosw commented Sep 20, 2023 • edited Loading

jagoosw commented Sep 20, 2023 • edited Loading

jagoosw commented Sep 20, 2023

jagoosw commented Sep 23, 2023

johnryantaylor left a comment

Choose a reason for hiding this comment

jagoosw commented Sep 8, 2023 •

edited

Loading

codecov bot commented Sep 11, 2023 •

edited

Loading

jagoosw commented Sep 11, 2023 •

edited

Loading

jagoosw commented Sep 14, 2023 •

edited

Loading

jagoosw commented Sep 14, 2023 •

edited

Loading

jagoosw commented Sep 20, 2023 •

edited

Loading

jagoosw commented Sep 20, 2023 •

edited

Loading