Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link to CUDA.jl 5.0 blog post. #38

Merged
merged 2 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions post/2023-09-19-cuda_5.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
+++
title = "CUDA.jl 5.0: Integrated profiler and task synchronization changes"
author = "Tim Besard"
external = true
abstract = """
CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks
how tasks are synchronized. The release is slightly breaking, as it changes how local
toolkits are handled and raises the minimum Julia and CUDA versions."""
+++

{{abstract}}

{{ redirect "https://info.juliahub.com/cuda-jl-5-0-changes" }}


<!--

## Integrated profiler

The most exciting new feature in CUDA.jl 5.0 is [the new integrated
profiler](https://github.com/JuliaGPU/CUDA.jl/pull/2024), which is similar to the `@profile`
macro from the Julia standard library. The profiler can be used by simply prefixing any code
that uses the CUDA libraries with `CUDA.@profile`:

```julia-repl
julia> CUDA.@profile CUDA.rand(1).+1
Profiler ran for 268.46 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│ 76.47% │ 205.28 µs │ 1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel │
│ 5.42% │ 14.54 µs │ 2 │ 7.27 µs │ 5.01 µs │ 9.54 µs │ cuMemAllocFromPoolAsync │
│ 2.93% │ 7.87 µs │ 1 │ 7.87 µs │ 7.87 µs │ 7.87 µs │ cuLaunchKernel │
│ 0.36% │ 953.67 ns │ 2 │ 476.84 ns │ 0.0 ns │ 953.67 ns │ cudaGetLastError │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
│ 0.44% │ 1.19 µs │ 1 │ 1.19 µs │ 1.19 µs │ 1.19 µs │ _Z13gen_sequencedI17curandS ⋯
│ 0.36% │ 953.67 ns │ 1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
1 column omitted
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
1.7242923
```

The output shown above is a summary of what happened during the execution of the code. It is
split into two sections: **host-side activity**, i.e., API calls to the CUDA libraries, and
the resulting **device-side activity**. As part of each section, the output shows the time
spent and the ratio to the total execution time. These ratios are important, and a good tool
to quickly assess the performance of your code. For example, in the above output, we see
that most of the time is spent on the host calling the CUDA libraries, and only very little
time is actually spent computing things on the GPU. This indicates that the GPU is severely
underutilized, which can be solved by increasing the problem size.

Instead of a summary, it is also possible to view a **chronological trace** by passing the
`trace=true` keyword argument:

```julia-repl
julia> CUDA.@profile trace=true CUDA.rand(1).+1;
Profiler ran for 262.98 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace)
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
│ ID │ Start │ Time │ Name │ Details │
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
│ 5 │ 6.44 µs │ 9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│ 7 │ 19.31 µs │ 715.26 ns │ cudaGetLastError │ - │
│ 8 │ 22.41 µs │ 204.09 µs │ cudaLaunchKernel │ - │
│ 9 │ 227.21 µs │ 0.0 ns │ cudaGetLastError │ - │
│ 14 │ 232.7 µs │ 3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│ 18 │ 250.34 µs │ 7.39 µs │ cuLaunchKernel │ - │
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘

Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace)
┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Name ⋯
├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
│ 8 │ 225.31 µs │ 1.19 µs │ 64 │ 64 │ 38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯
│ 18 │ 257.73 µs │ 1.19 µs │ 1 │ 1 │ 18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
1 column omitted
```

Here, we can see a list of events that the profiler captured. Each event has a unique ID,
which can be used to corelate host-side and device-side events. For example, we can see that
event 8 on the host is a call to `cudaLaunchKernel`, which corresponds to to the execution
of a CURAND kernel on the device.

The integrated profiler is a great tool to quickly assess the performance of your GPU
application, identify bottlenecks, and find opportunities for optimization. For complex
applications, however, it is still recommended to use NVIDIA's NSight Systems or Compute
profilers, which provide a more detailed, graphical view of what is happening on the GPU.


## Synchronization on worker threads

Another noteworthy change affects how tasks are synchronized. To enable concurrent
execution, i.e., to make it possible for other Julia tasks to execute while waiting for the
GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a
significant source of latency, at least 25us per invocation but sometimes *much* longer, and
have also been slated for deprecation and eventual removal from the CUDA toolkit.

Instead, on Julia 1.9 and later, CUDA.jl [now
uses](https://github.com/JuliaGPU/CUDA.jl/pull/2025) worker threads to wait for GPU
operations to finish. This mechanism is significantly faster, taking around 5us per
invocation, but more importantly offers a much more reliable and predictable latency. You
can observe this mechanism using the integrated profiler:

```julia-repl
julia> a = CUDA.rand(1024, 1024, 1024)
julia> CUDA.@profile trace=true CUDA.@sync a .+ a
Profiler ran for 12.29 ms, capturing 527 events.

Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace)
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
│ ID │ Start │ Time │ Thread │ Name │
├─────┼───────────┼───────────┼────────┼─────────────────────────┤
│ 5 │ 6.91 µs │ 13.59 µs │ 1 │ cuMemAllocFromPoolAsync │
│ 9 │ 36.72 µs │ 199.56 µs │ 1 │ cuLaunchKernel │
│ 525 │ 510.69 µs │ 11.75 ms │ 2 │ cuStreamSynchronize │
└─────┴───────────┴───────────┴────────┴─────────────────────────┘
```

For some users, this may still be too slow, so we have added two mechanisms that disable
nonblocking synchronization and simply block the calling thread until the GPU operation
finishes. The first is a global setting, which can be enabled by setting the
`nonblocking_synchronization` preference to `false`, which can be done using Preferences.jl.
[The second](https://github.com/JuliaGPU/CUDA.jl/pull/2060) is a fine-grained flag to pass
to synchronization functions: `synchronize(x; blocking=true)`, `CUDA.@sync blocking=true
...`, etc. Both these mechanisms should *not* be used widely, and are only intended for use
in latency-critical code, e.g., when benchmarking or profiling.


## Local toolkit discovery

One of the breaking changes involves [how local toolkits are
discovered](https://github.com/JuliaGPU/CUDA.jl/pull/2058), when opting out of the use of
artifacts. Previously, this could be enabled by calling
`CUDA.set_runtime_version!("local")`, which generated a `version = "local"` preference. We
are now changing this into two separate preferences, `version` and `local`, where the
`version` preference overrides the version of the CUDA toolkit, and the `local`
preference independently indicates whether to use a local CUDA toolkit or not.

Concretely, this means that you will now need to call
`CUDA.set_runtime_version!(local_toolkit=true)` to enable the use of a local toolkit. The
toolkit version will be auto-detected, but can be overridden by also passing a version:
`CUDA.set_runtime_version!(version; local_toolkit=true)`. This may be necessary when CUDA
is not available during precompilation, e.g., on the log-in node of a cluster, or when
building a container image.


## Raised minimum requirements

Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version
is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit
version is now 11.4, but this cannot be enforced by the package manager. As a result, if you
need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or
below. [The README](https://github.com/JuliaGPU/CUDA.jl/blob/master/README.md) will maintain
a table of supported CUDA toolkit versions.

Most users will not be affected by this change: If you use the artifact-provided CUDA
toolkit, you will automatically get the latest version supported by your CUDA driver.


## Other changes

- [Support for CUDA 12.2](https://github.com/JuliaGPU/CUDA.jl/pull/2034);
- [Memory limits](https://github.com/JuliaGPU/CUDA.jl/pull/2040) are now enforced by CUDA,
resulting in better performance;
- [Support for Julia 1.10](https://github.com/JuliaGPU/CUDA.jl/pull/1946) (with help from
[@dkarrasch](https://github.com/dkarrasch));
- Support for batched [`gemm`](https://github.com/JuliaGPU/CUDA.jl/pull/1975),
[`gemv`](https://github.com/JuliaGPU/CUDA.jl/pull/1981) and
[`svd`](https://github.com/JuliaGPU/CUDA.jl/pull/2063) (by
[@lpawela](https://github.com/lpawela) and [@nikopj](https://github.com/nikopj).

-->
24 changes: 23 additions & 1 deletion utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,27 @@ function getdate(fname)
return parse.(Int, (y, m, d))
end

function hfun_redirect(params)
url = params[1]

# XXX: is this safe?
#Franklin.set_var!(Franklin.LOCAL_VARS, "fd_full_url", url)

return """
<meta http-equiv="refresh" content="0;url=$url">

<!-- fallback 1: use Javscript, in case the browser doesn't like meta tags in the body -->
<script>
window.onload = function() {
window.location.href = "$url";
}
</script>

<!-- fallback 2: provide a visual element, in case Javascript is disabled -->
<p>This blog post is located at <a href="$url">$url</a></p>
"""
end

function hfun_post_date()
# capture the RSS publication date from the file name
fd_url = locvar(:fd_url)::String
Expand Down Expand Up @@ -72,11 +93,12 @@ function blogpost_entry(fpath)
if hidden === true
return nothing
end
ext = something(pagevar(rpath, :external), false)
title = pagevar(rpath, :title)::String
y, m, d = getdate(fpath)
rpath = replace(fpath, r"\.md$" => "")
date = Date(y, m, d)
return (date, blogpost_entry_html("/post/$rpath/", title, y, m, d))
return (date, blogpost_entry_html("/post/$rpath/", title, y, m, d; ext))
end

function blogpost_external_entries()
Expand Down
Loading