Skip to content

Commit

Permalink
Rewrite some sections
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Nov 29, 2024
1 parent 97ab406 commit df145f0
Showing 1 changed file with 23 additions and 200 deletions.
223 changes: 23 additions & 200 deletions docs/how-to/hip_runtime_api/asynchronous.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,26 @@
Asynchronous concurrent execution
*******************************************************************************

Asynchronous concurrent execution empowers developers to achieve efficient
parallelism and resource utilization. By understanding and implementing key
concepts and best practices, significant performance improvements are within
reach. Techniques such as overlapping computation and data transfer, managing
concurrent kernel execution, and utilizing graphs offer a robust framework for
optimizing GPU performance. As GPU technology evolves, the principles of
asynchronous execution remain critical for achieving high throughput and low
latency. Developers are encouraged to explore and experiment with these
techniques to fully harness their potential.
Asynchronous concurrent execution important for efficient parallelism and
resource utilization, with techniques such as overlapping computation and data
transfer, managing concurrent kernel execution with streams on single or
multiple devices or using HIP graphs.

Streams and concurrent execution
===============================================================================

All asynchronous APIs, such as kernel execution, data movement and potentially
data allocation/freeing all happen in the context of device streams.


Streams are FIFO buffers of commands to execute relating to a given device.
Commands which enqueue tasks on a stream all return promptly and the command is
executed asynchronously. All side effects of a command on a stream are visible
to all subsequent commands on the same stream. Multiple streams may point to
the same device and those streams may be fed from multiple concurrent host-side
threads. Execution on multiple streams may be concurrent but isn't required to
be.

Streams play a crucial role in managing the execution order of kernels and
memory operations on the GPU. By utilizing streams, developers can ensure
efficient execution of tasks, leading to improved performance and resource
Expand Down Expand Up @@ -423,196 +430,12 @@ Example
HIP Graphs
===============================================================================

HIP Graphs provide a way to represent complex workflows as a series of
interconnected tasks. By creating and managing graphs, developers can optimize
dependent task execution. Graphs reduce the overhead associated with launching
individual kernels and memory operations, providing a high-level abstraction
for managing dependencies and synchronizing tasks. Examples include
representing a sequence of kernels and memory operations as a single graph.
Using graphs enhances performance and simplifies complex workflow management.
This technique is particularly useful for applications with intricate
dependencies and multiple execution stages.
HIP graphs offer an efficient alternative to the standard method of launching
GPU tasks via streams. Comprising nodes for operations and edges for
dependencies, HIP graphs reduce kernel launch overhead and provide a high-level
abstraction for managing dependencies and synchronization. By representing
sequences of kernels and memory operations as a single graph, they simplify
complex workflows and enhance performance, particularly for applications with
intricate dependencies and multiple execution stages.

For more details, see the :ref:`how_to_HIP_graph` documentation.

Example
-------------------------------------------------------------------------------

This example demonstrates the use of HIP Graphs to manage asynchronous
concurrent execution of two kernels. It creates a graph with nodes for the
kernel executions and memory copies, which are then instantiated and launched
in two separate streams. This setup ensures efficient and concurrent execution,
leveraging the high-level abstraction of HIP Graphs to simplify the workflow
and improve performance.

.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
__global__ void kernel(int *data, int value)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] = value;
}
int main()
{
constexpr int N = 1024;
int *d_data1, *d_data2;
int h_data1[N], h_data2[N];
hipGraph_t graph;
hipGraphExec_t graphExec;
hipStream_t stream1, stream2;
hipGraphNode_t kernelNode1, kernelNode2, memcpyNode1, memcpyNode2;
hipKernelNodeParams kernelNodeParams1 = {0};
hipKernelNodeParams kernelNodeParams2 = {0};
hipMemcpy3DParms memcpyParams1 = {0};
hipMemcpy3DParms memcpyParams2 = {0};
// Allocate device memory
hipMalloc(&d_data1, N * sizeof(*d_data1));
hipMalloc(&d_data2, N * sizeof(*d_data2));
// Create streams
hipStreamCreate(&stream1);
hipStreamCreate(&stream2);
// Create an empty graph
hipGraphCreate(&graph, 0);
// Define kernel1 node parameters
void *kernelArgs1[] = {&d_data1, &N};
kernelNodeParams1.func = reinterpret_cast<void *>(kernel);
kernelNodeParams1.gridDim = dim3(N / 256);
kernelNodeParams1.blockDim = dim3(256);
kernelNodeParams1.sharedMemBytes = 0;
kernelNodeParams1.kernelParams = kernelArgs1;
kernelNodeParams1.extra = nullptr;
// Define kernel2 node parameters
void *kernelArgs2[] = {&d_data2, &N};
kernelNodeParams2.func = reinterpret_cast<void *>(kernel);
kernelNodeParams2.gridDim = dim3(N / 256);
kernelNodeParams2.blockDim = dim3(256);
kernelNodeParams2.sharedMemBytes = 0;
kernelNodeParams2.kernelParams = kernelArgs2;
kernelNodeParams2.extra = nullptr;
// Add kernel nodes to graph
hipGraphAddKernelNode(&kernelNode1, graph, nullptr, 0, &kernelNodeParams1);
hipGraphAddKernelNode(&kernelNode2, graph, nullptr, 0, &kernelNodeParams2);
// Define memcpy node parameters for stream1
memcpyParams1.srcArray = nullptr;
memcpyParams1.srcPos = make_hipPos(0, 0, 0);
memcpyParams1.dstArray = nullptr;
memcpyParams1.dstPos = make_hipPos(0, 0, 0);
memcpyParams1.extent = make_hipExtent(N * sizeof(*d_data1), 1, 1);
memcpyParams1.kind = hipMemcpyDeviceToHost;
memcpyParams1.srcPtr = make_hipPitchedPtr(d_data1, N * sizeof(*d_data1), N, 1);
memcpyParams1.dstPtr = make_hipPitchedPtr(h_data1, N * sizeof(*d_data1), N, 1);
// Define memcpy node parameters for stream2
memcpyParams2.srcArray = nullptr;
memcpyParams2.srcPos = make_hipPos(0, 0, 0);
memcpyParams2.dstArray = nullptr;
memcpyParams2.dstPos = make_hipPos(0, 0, 0);
memcpyParams2.extent = make_hipExtent(N * sizeof(*d_data2), 1, 1);
memcpyParams2.kind = hipMemcpyDeviceToHost;
memcpyParams2.srcPtr = make_hipPitchedPtr(d_data2, N * sizeof(*d_data2), N, 1);
memcpyParams2.dstPtr = make_hipPitchedPtr(h_data2, N * sizeof(*d_data2), N, 1);
// Add memcpy nodes to graph
hipGraphAddMemcpyNode(&memcpyNode1, graph, &kernelNode1, 1, &memcpyParams1);
hipGraphAddMemcpyNode(&memcpyNode2, graph, &kernelNode2, 1, &memcpyParams2);
// Instantiate the graph
hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
// Launch the graph asynchronously in different streams
hipGraphLaunch(graphExec, stream1);
hipGraphLaunch(graphExec, stream2);
// Wait for all operations in both streams to complete
hipStreamSynchronize(stream1);
hipStreamSynchronize(stream2);
// Cleanup
hipGraphExecDestroy(graphExec);
hipGraphDestroy(graph);
hipStreamDestroy(stream1);
hipStreamDestroy(stream2);
hipFree(d_data1);
hipFree(d_data2);
std::cout << "Graph executed with asynchronous concurrent execution." << std::endl;
return 0;
}
Best practices and performance optimization
===============================================================================

Achieving optimal performance in GPU-accelerated applications involves adhering
to best practices and continuously tuning the code to ensure efficient resource
utilization.

Implementing best practices
-------------------------------------------------------------------------------

Following best practices for managing asynchronous operations is crucial for
achieving optimal performance. Here are some key strategies to consider:

- minimize synchronization overhead: Synchronize only when necessary to avoid
stalling the GPU and hindering parallelism.

- leverage asynchronous operations: Use asynchronous memory transfers and
kernel launches to overlap computation and data transfer, maximizing resource
utilization.

- balance workloads: Distribute tasks efficiently between the host and device
to ensure both are fully utilized. This can significantly enhance application
responsiveness and performance.

- utilize multiple streams: Create and manage multiple streams to run commands
concurrently, reducing idle times and improving overall efficiency.

By implementing these strategies, developers can significantly enhance
application responsiveness and overall performance. These best practices are
essential for effective asynchronous operation management and for fully
leveraging the capabilities of modern GPUs.

Balancing and profiling
-------------------------------------------------------------------------------

Profiling tools help identify bottlenecks by providing detailed insights into
the execution of GPU-accelerated applications. These tools allow developers to
visualize how computational tasks and memory operations are distributed across
different hardware resources. By analyzing these visualizations, developers can
pinpoint areas where the application may be spending excessive time, such as
during synchronization points or data transfers.

Key profiling metrics include:

- kernel execution time: Measuring the time spent executing each kernel helps
identify which kernels are taking longer than expected and may need
optimization.

- memory transfer time: Assessing the duration of data transfers between the
host and device can highlight inefficiencies or bottlenecks in memory
operations.

- stream utilization: Evaluating how streams are utilized can reveal whether
resources are being used effectively or if some streams are underutilized.

- concurrency: Analyzing the overlap of computation and data transfers helps
identify opportunities to improve concurrency and reduce idle times.

Using profiling tools, developers gain a comprehensive understanding of their
application's performance characteristics, making informed decisions about
where to focus optimization efforts. Regular profiling and adjustments ensure
that applications run at their best, maintaining high efficiency and
performance.

0 comments on commit df145f0

Please sign in to comment.