diff --git a/docs/how-to/hip_runtime_api/asynchronous.rst b/docs/how-to/hip_runtime_api/asynchronous.rst index 62791361de..b819032501 100644 --- a/docs/how-to/hip_runtime_api/asynchronous.rst +++ b/docs/how-to/hip_runtime_api/asynchronous.rst @@ -8,19 +8,26 @@ Asynchronous concurrent execution ******************************************************************************* -Asynchronous concurrent execution empowers developers to achieve efficient -parallelism and resource utilization. By understanding and implementing key -concepts and best practices, significant performance improvements are within -reach. Techniques such as overlapping computation and data transfer, managing -concurrent kernel execution, and utilizing graphs offer a robust framework for -optimizing GPU performance. As GPU technology evolves, the principles of -asynchronous execution remain critical for achieving high throughput and low -latency. Developers are encouraged to explore and experiment with these -techniques to fully harness their potential. +Asynchronous concurrent execution important for efficient parallelism and +resource utilization, with techniques such as overlapping computation and data +transfer, managing concurrent kernel execution with streams on single or +multiple devices or using HIP graphs. Streams and concurrent execution =============================================================================== +All asynchronous APIs, such as kernel execution, data movement and potentially +data allocation/freeing all happen in the context of device streams. + + +Streams are FIFO buffers of commands to execute relating to a given device. +Commands which enqueue tasks on a stream all return promptly and the command is +executed asynchronously. All side effects of a command on a stream are visible +to all subsequent commands on the same stream. Multiple streams may point to +the same device and those streams may be fed from multiple concurrent host-side +threads. Execution on multiple streams may be concurrent but isn't required to +be. + Streams play a crucial role in managing the execution order of kernels and memory operations on the GPU. By utilizing streams, developers can ensure efficient execution of tasks, leading to improved performance and resource @@ -423,196 +430,12 @@ Example HIP Graphs =============================================================================== -HIP Graphs provide a way to represent complex workflows as a series of -interconnected tasks. By creating and managing graphs, developers can optimize -dependent task execution. Graphs reduce the overhead associated with launching -individual kernels and memory operations, providing a high-level abstraction -for managing dependencies and synchronizing tasks. Examples include -representing a sequence of kernels and memory operations as a single graph. -Using graphs enhances performance and simplifies complex workflow management. -This technique is particularly useful for applications with intricate -dependencies and multiple execution stages. +HIP graphs offer an efficient alternative to the standard method of launching +GPU tasks via streams. Comprising nodes for operations and edges for +dependencies, HIP graphs reduce kernel launch overhead and provide a high-level +abstraction for managing dependencies and synchronization. By representing +sequences of kernels and memory operations as a single graph, they simplify +complex workflows and enhance performance, particularly for applications with +intricate dependencies and multiple execution stages. For more details, see the :ref:`how_to_HIP_graph` documentation. - -Example -------------------------------------------------------------------------------- - -This example demonstrates the use of HIP Graphs to manage asynchronous -concurrent execution of two kernels. It creates a graph with nodes for the -kernel executions and memory copies, which are then instantiated and launched -in two separate streams. This setup ensures efficient and concurrent execution, -leveraging the high-level abstraction of HIP Graphs to simplify the workflow -and improve performance. - -.. code-block:: cpp - - #include - #include - - __global__ void kernel(int *data, int value) - { - int idx = threadIdx.x + blockIdx.x * blockDim.x; - data[idx] = value; - } - - int main() - { - constexpr int N = 1024; - - int *d_data1, *d_data2; - int h_data1[N], h_data2[N]; - - hipGraph_t graph; - hipGraphExec_t graphExec; - hipStream_t stream1, stream2; - hipGraphNode_t kernelNode1, kernelNode2, memcpyNode1, memcpyNode2; - hipKernelNodeParams kernelNodeParams1 = {0}; - hipKernelNodeParams kernelNodeParams2 = {0}; - hipMemcpy3DParms memcpyParams1 = {0}; - hipMemcpy3DParms memcpyParams2 = {0}; - - // Allocate device memory - hipMalloc(&d_data1, N * sizeof(*d_data1)); - hipMalloc(&d_data2, N * sizeof(*d_data2)); - - // Create streams - hipStreamCreate(&stream1); - hipStreamCreate(&stream2); - - // Create an empty graph - hipGraphCreate(&graph, 0); - - // Define kernel1 node parameters - void *kernelArgs1[] = {&d_data1, &N}; - kernelNodeParams1.func = reinterpret_cast(kernel); - kernelNodeParams1.gridDim = dim3(N / 256); - kernelNodeParams1.blockDim = dim3(256); - kernelNodeParams1.sharedMemBytes = 0; - kernelNodeParams1.kernelParams = kernelArgs1; - kernelNodeParams1.extra = nullptr; - - // Define kernel2 node parameters - void *kernelArgs2[] = {&d_data2, &N}; - kernelNodeParams2.func = reinterpret_cast(kernel); - kernelNodeParams2.gridDim = dim3(N / 256); - kernelNodeParams2.blockDim = dim3(256); - kernelNodeParams2.sharedMemBytes = 0; - kernelNodeParams2.kernelParams = kernelArgs2; - kernelNodeParams2.extra = nullptr; - - // Add kernel nodes to graph - hipGraphAddKernelNode(&kernelNode1, graph, nullptr, 0, &kernelNodeParams1); - hipGraphAddKernelNode(&kernelNode2, graph, nullptr, 0, &kernelNodeParams2); - - // Define memcpy node parameters for stream1 - memcpyParams1.srcArray = nullptr; - memcpyParams1.srcPos = make_hipPos(0, 0, 0); - memcpyParams1.dstArray = nullptr; - memcpyParams1.dstPos = make_hipPos(0, 0, 0); - memcpyParams1.extent = make_hipExtent(N * sizeof(*d_data1), 1, 1); - memcpyParams1.kind = hipMemcpyDeviceToHost; - memcpyParams1.srcPtr = make_hipPitchedPtr(d_data1, N * sizeof(*d_data1), N, 1); - memcpyParams1.dstPtr = make_hipPitchedPtr(h_data1, N * sizeof(*d_data1), N, 1); - - // Define memcpy node parameters for stream2 - memcpyParams2.srcArray = nullptr; - memcpyParams2.srcPos = make_hipPos(0, 0, 0); - memcpyParams2.dstArray = nullptr; - memcpyParams2.dstPos = make_hipPos(0, 0, 0); - memcpyParams2.extent = make_hipExtent(N * sizeof(*d_data2), 1, 1); - memcpyParams2.kind = hipMemcpyDeviceToHost; - memcpyParams2.srcPtr = make_hipPitchedPtr(d_data2, N * sizeof(*d_data2), N, 1); - memcpyParams2.dstPtr = make_hipPitchedPtr(h_data2, N * sizeof(*d_data2), N, 1); - - // Add memcpy nodes to graph - hipGraphAddMemcpyNode(&memcpyNode1, graph, &kernelNode1, 1, &memcpyParams1); - hipGraphAddMemcpyNode(&memcpyNode2, graph, &kernelNode2, 1, &memcpyParams2); - - // Instantiate the graph - hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0); - - // Launch the graph asynchronously in different streams - hipGraphLaunch(graphExec, stream1); - hipGraphLaunch(graphExec, stream2); - - // Wait for all operations in both streams to complete - hipStreamSynchronize(stream1); - hipStreamSynchronize(stream2); - - // Cleanup - hipGraphExecDestroy(graphExec); - hipGraphDestroy(graph); - hipStreamDestroy(stream1); - hipStreamDestroy(stream2); - hipFree(d_data1); - hipFree(d_data2); - - std::cout << "Graph executed with asynchronous concurrent execution." << std::endl; - - return 0; - } - -Best practices and performance optimization -=============================================================================== - -Achieving optimal performance in GPU-accelerated applications involves adhering -to best practices and continuously tuning the code to ensure efficient resource -utilization. - -Implementing best practices -------------------------------------------------------------------------------- - -Following best practices for managing asynchronous operations is crucial for -achieving optimal performance. Here are some key strategies to consider: - -- minimize synchronization overhead: Synchronize only when necessary to avoid - stalling the GPU and hindering parallelism. - -- leverage asynchronous operations: Use asynchronous memory transfers and - kernel launches to overlap computation and data transfer, maximizing resource - utilization. - -- balance workloads: Distribute tasks efficiently between the host and device - to ensure both are fully utilized. This can significantly enhance application - responsiveness and performance. - -- utilize multiple streams: Create and manage multiple streams to run commands - concurrently, reducing idle times and improving overall efficiency. - -By implementing these strategies, developers can significantly enhance -application responsiveness and overall performance. These best practices are -essential for effective asynchronous operation management and for fully -leveraging the capabilities of modern GPUs. - -Balancing and profiling -------------------------------------------------------------------------------- - -Profiling tools help identify bottlenecks by providing detailed insights into -the execution of GPU-accelerated applications. These tools allow developers to -visualize how computational tasks and memory operations are distributed across -different hardware resources. By analyzing these visualizations, developers can -pinpoint areas where the application may be spending excessive time, such as -during synchronization points or data transfers. - -Key profiling metrics include: - -- kernel execution time: Measuring the time spent executing each kernel helps - identify which kernels are taking longer than expected and may need - optimization. - -- memory transfer time: Assessing the duration of data transfers between the - host and device can highlight inefficiencies or bottlenecks in memory - operations. - -- stream utilization: Evaluating how streams are utilized can reveal whether - resources are being used effectively or if some streams are underutilized. - -- concurrency: Analyzing the overlap of computation and data transfers helps - identify opportunities to improve concurrency and reduce idle times. - -Using profiling tools, developers gain a comprehensive understanding of their -application's performance characteristics, making informed decisions about -where to focus optimization efforts. Regular profiling and adjustments ensure -that applications run at their best, maintaining high efficiency and -performance. \ No newline at end of file