Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for recursively printing parameter memory #2560

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

charleskawczynski
Copy link
Contributor

As Clima has developed increasingly complex models and fused increasingly complex broadcast expressions, we've been running into parameter memory issues more frequently.

One issue I have with the existing printed message is that it does not provide granularity for large objects.

This PR implements a recursive print function/macro @rprint_parameter_memory(some_object) that users can use (and build options around) to print parameter memory usage with high granularity. For example (which I've tentatively implemented in MultiBroadcastFusion):

fmb
size: 72, fmb.pairs::Tuple{…}
size: 16, fmb.pairs.1::Pair{…}
size: 64, fmb.pairs.1.first::CUDA.CuArray{…}
size: 16, fmb.pairs.1.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.first.dims::NTuple{…}
size: 64, fmb.pairs.1.second::Base.Broadcast.Broadcasted{…}
size: 64, fmb.pairs.1.second.args::Tuple{…}
size: 64, fmb.pairs.1.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.1.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.1.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.1.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.1.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.1.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.1.second.args.1.dims::NTuple{…}
size: 24, fmb.pairs.2::Pair{…}
size: 64, fmb.pairs.2.first::CUDA.CuArray{…}
size: 16, fmb.pairs.2.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.first.dims::NTuple{…}
size: 128, fmb.pairs.2.second::Base.Broadcast.Broadcasted{…}
size: 128, fmb.pairs.2.second.args::Tuple{…}
size: 64, fmb.pairs.2.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.2.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.2.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.2.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.2.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.2.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.2.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.2.second.args.2.dims::NTuple{…}
size: 32, fmb.pairs.3::Pair{…}
size: 64, fmb.pairs.3.first::CUDA.CuArray{…}
size: 16, fmb.pairs.3.first.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.first.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.first.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.first.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.first.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.first.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.first.dims::NTuple{…}
size: 192, fmb.pairs.3.second::Base.Broadcast.Broadcasted{…}
size: 192, fmb.pairs.3.second.args::Tuple{…}
size: 64, fmb.pairs.3.second.args.1::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.1.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.1.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.1.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.1.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.1.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.1.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.2::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.2.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.2.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.2.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.2.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.2.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.2.dims::NTuple{…}
size: 64, fmb.pairs.3.second.args.3::CUDA.CuArray{…}
size: 16, fmb.pairs.3.second.args.3.data::GPUArrays.DataRef{…}
size: 24, fmb.pairs.3.second.args.3.data.rc::GPUArrays.RefCounted{…}
size: 64, fmb.pairs.3.second.args.3.data.rc.obj::CUDA.Managed{…}
size: 48, fmb.pairs.3.second.args.3.data.rc.obj.mem::CUDA.DeviceMemory
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.mem.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.data.rc.obj.stream::CUDA.CuStream
size: 16, fmb.pairs.3.second.args.3.data.rc.obj.stream.ctx::CUDA.CuContext
size: 40, fmb.pairs.3.second.args.3.dims::NTuple{…}

I'm cc-ing some people who may also be interested in this: @glwagner @simonbyrne @simone-silvestri

@charleskawczynski
Copy link
Contributor Author

I'm open to changing the format somehow, but I do think that this format is somewhat simple and explicit.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

This feels like a very niche feature that I'm not sure is worth putting in CUDA.jl. We already report on the size of each argument; can't you keep the specific functionality to analyze within individual arguments in your package, or even a dedicated package for this purpose? I'd rather expose some way for you to perform that analysis on the actual kernel arguments (e.g., by saving them in the error that's thrown by @cuda).

@maleadt maleadt added speculative Not sure about this one yet. enhancement New feature or request cuda kernels Stuff about writing CUDA kernels. labels Dec 3, 2024
@glwagner
Copy link
Contributor

glwagner commented Dec 3, 2024

(e.g., by saving them in the error that's thrown by @cuda).

Whatever feature is implemented to help solve parameter space problems, a key criterion should be that it does not require launching a kernel to do the debugging. It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching + digging through stacktraces which is the main issue with the current workflow.

@maleadt
Copy link
Member

maleadt commented Dec 3, 2024

It's much more efficient to design kernel arguments by direct inspection, rather than by trial-and-error kernel launching

I fail to see what's more convenient about doing @rprint_parameter_memory(some_object) (after you somehow decided the kernel will fail to launch) as opposed to a try ... catch surrounding a call to @cuda and prying the arguments from that error (instead relying on the source of truth wrt. whether the kernel would run or not).

Can you elaborate on the workflow you want? I'm proposing here that you would be able to catch a KernelError, containing all arguments for you do call @rprint_parameter_memory (or whatever thing you maintain locally) on, instead of CUDA.jl potentially generating a relatively inscrutable (at least to most users) infodump.

@glwagner
Copy link
Contributor

glwagner commented Dec 3, 2024

The workflow is:

  1. Discover a parameter space error by attempting to run some program. These programs can be complex; for example reaching the desired kernel may require constructing some intermediate objects that also involve some computation. A typical time to reach some error could be 10 or even 20 minutes.

  2. Inspect the error message, which helpfully prints the parameter space usage of the kernel arguments. After this one can take one or two actions: a) split the kernel into components so that each component requires fewer arguments or b) somehow simplify the objects being passed to the kernel.

Executing on 2a doesn't really require any new features; we can simply do the arithmetic to figure out whether the kernels will succeed.

For 2b, we may have to change adapt_structure or, deeper down, experiment with changing the objects themselves. For example, one change we are tempted to experiment with is to allow OffsetArray with offsets that are Int8 (more generally variable Int). Predicting the potential savings of that change may be difficult, because some objects involve a mixture of objects including many OffsetArrays. Therefore, to test whether such deep changes will succeed, we will have to recompile and run our MWE. Since the MWE takes 10-20 minutes this is slow. On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

Here's an example conversation where we are trying to deduce how to solve a parameter space problem:

CliMA/ClimaOcean.jl#116

@maleadt
Copy link
Member

maleadt commented Dec 4, 2024

On the other hand, if we could simply print the parameter space usage of some large object that we are making changes to, we could iterate a bit more quickly.

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

FWIW, you should be able to make this all type-based, by just inspecting the device-side types that CUDA.jl already reports:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13),))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}} uses 64.000 KiB

With a recursive size printer that operates on Tuple{NTuple{8192, UInt64}}, you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

@glwagner
Copy link
Contributor

glwagner commented Dec 4, 2024

I see, so this is strictly a development utility that doesn't actually need any support in CUDA.jl?

Correct, I think the only point of putting in CUDA is to make it more visible / keep it up to date with CUDA development. It could easily find a home elsewhere. Also it doesn't even really need to be packaged since I don't really see much scope for further development --- offering it in a package is merely trying to be friendly to other developers I guess.

you don't even need to call into any CUDA.jl internals (i.e., no calls to cudaconvert).

I'm not sure I understand though @charleskawczynski might... the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure), right?

@maleadt
Copy link
Member

maleadt commented Dec 5, 2024

the point of cudaconvert is to isolate the objects that get passed into kernel (eg after being passed through adapt_structure)

Yes, but in the error message that's reported by CUDA.jl you already get to see the types of the converted arguments:

julia> @cuda Returns(nothing)((ntuple(_->UInt64(1), 2^13), CUDA.rand(1)))
ERROR: Kernel invocation uses too much parameter memory.
64.047 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.5.

Relevant parameters:
  [1] args::Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}} uses 64.031 KiB

So you helper could ingest the Type{Tuple{NTuple{8192, UInt64}, CuDeviceVector{Float32, 1}}} and break it down exactly like how's done in the OP. That would make the utility fully generic.

@maleadt maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request speculative Not sure about this one yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants