diff --git a/docs/conf.py b/docs/conf.py index 4ac5cc3208..82bcefee89 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -50,6 +50,5 @@ exclude_patterns = [ "doxygen/mainpage.md", - "understand/glossary.md", - "understand/thread_hierarchy_coop_figure.rst" + "understand/glossary.md" ] \ No newline at end of file diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile index aa71441160..b0e7231aa6 100644 --- a/docs/doxygen/Doxyfile +++ b/docs/doxygen/Doxyfile @@ -834,7 +834,6 @@ INPUT = mainpage.md \ ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \ ../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \ ../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \ - ../../../llvm-project/clang/lib/Headers/__clang_hip_math.h \ ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h # This tag can be used to specify the character encoding of the source files diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/cooperative_groups.rst index 00290bf6ca..e70c9d718b 100644 --- a/docs/how-to/cooperative_groups.rst +++ b/docs/how-to/cooperative_groups.rst @@ -8,9 +8,7 @@ Cooperative Groups ******************************************************************************* -Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. The API enables the developers to specify the level of granularity for thread communication which can lead to more efficient parallel decompositions. - -.. Maybe this sentence is better: The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than those defined by the hardware. +Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. Cooperative Groups let you define your own set of thread groups which may fit your user-cases better than those defined by the hardware. This lets you specify the level of granularity for thread communication which can lead to more efficient parallel decompositions. The API is accessible in the ``cooperative_groups`` namespace after the ``hip_cooperative_groups.h`` is included. The header contains the following elements: @@ -37,11 +35,13 @@ The thread hierarchy abstraction of cooperative groups is in Cooperative group thread hierarchy. +The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. + .. note:: - This feature introduced a new level between thread and thread block in the programming model. + The ability to synchronize over a grid or multi grid requires the kernel to be launched using the specific cooperative groups API. -The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. The ability to synchronize over a grid requires the kernel to be launched using the cooperative groups API. The **block** is the same as the :ref:`inherent_thread_model` block entity. +The **block** is the same as the :ref:`inherent_thread_model` block entity. .. note:: @@ -92,7 +92,7 @@ are public of the ``grid_group`` class. For further details, check the :ref:`gri Multi-grid group ------------------ -Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. All the multi-grid group APIs require that you have used the appropriate launch API. +Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. Use the cooperative launch API to enable synchronization across the multi-grid group. .. code-block:: cpp @@ -130,15 +130,15 @@ Constructed via: .. note:: - * ``Size`` must be a power of 2 and not larger than warp (wavefront) size. - * ``shfl`` functions support integer or float type. + * Size must be a power of 2 and not larger than warp (wavefront) size. + * ``shfl()`` functions support integer or float type. -The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, ``shfl_up(...)`` and ``shfl_xor(...)`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references ` . +The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl()``, ``shfl_down()``, ``shfl_up()``, ``shfl_xor()``, ``ballot()``, ``any()``, ``all()``, ``match_any()`` and ``match_all()`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references ` . Coalesced groups ------------------ -Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads referred as coalesced and coalesced groups represents an active thread group within a warp. +Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp. This group type also supports sub-wave level intrinsics. @@ -154,9 +154,9 @@ Constructed via: .. note:: - * ``shfl`` functions support integer or float type. + ``shfl()`` functions support integer or float type. -The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, and ``shfl_up(...)`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references ` . +The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl()``, ``shfl_down()``, ``shfl_up()``, ``ballot()``, ``any()``, ``all()``, ``match_any()`` and ``match_all()`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references ` . Cooperative groups simple example ================================= diff --git a/docs/reference/cooperative_groups_reference.rst b/docs/reference/cooperative_groups_reference.rst index f044b69219..52ca19ab6a 100644 --- a/docs/reference/cooperative_groups_reference.rst +++ b/docs/reference/cooperative_groups_reference.rst @@ -1,6 +1,5 @@ .. meta:: - :description: This chapter lists types and device API wrappers related to the Cooperative Group - feature. Programmers can directly use them in their kernels to make use of this feature. + :description: This chapter lists types and device API wrappers related to the Cooperative Group feature. Programmers can directly use these API features in their kernels. :keywords: AMD, ROCm, HIP, cooperative groups .. _cooperative_groups_reference: @@ -12,7 +11,7 @@ HIP Cooperative Groups API Cooperative kernel launches =========================== -The following host side functions used for cooperative kernel launches. +The following host-side functions are used for cooperative kernel launches. .. doxygenfunction:: hipLaunchCooperativeKernel(const void* f, dim3 gridDim, dim3 blockDimX, void** kernelParams, unsigned int sharedMemBytes, hipStream_t stream) @@ -48,7 +47,7 @@ The following cooperative groups classes can be used on the device side. .. doxygenclass:: cooperative_groups::multi_grid_group :members: - + .. _thread_block_tile_ref: .. doxygenclass:: cooperative_groups::thread_block_tile @@ -62,7 +61,7 @@ The following cooperative groups classes can be used on the device side. Cooperative groups construct functions ====================================== -The following functions are used to construct different group types instances on the device side. +The following functions are used to construct different group-type instances on the device side. .. doxygenfunction:: cooperative_groups::this_multi_grid @@ -83,7 +82,7 @@ The following functions are used to construct different group types instances on Cooperative groups exposed API functions ======================================== -The following functions are the exposed API for different group types instances on the device side. +The following functions are the exposed API for different group-type instances on the device side. .. doxygenfunction:: cooperative_groups::group_size diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index d7a9278bc2..7080659f2d 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -45,7 +45,6 @@ subtrees: - file: reference/cooperative_groups_reference title: HIP Cooperative Groups API - file: reference/virtual_rocr - title: HSA Runtime API for ROCm - file: reference/unified_memory_reference title: HIP Managed Memory Allocation API - file: reference/deprecated_api_list diff --git a/docs/tutorial/cooperative_groups_tutorial.rst b/docs/tutorial/cooperative_groups_tutorial.rst index 0fd018b546..270bedae75 100644 --- a/docs/tutorial/cooperative_groups_tutorial.rst +++ b/docs/tutorial/cooperative_groups_tutorial.rst @@ -46,10 +46,10 @@ You can use tiled partition to calculate the sum of ``partition_size`` length se return result; } -Device side code +Device-side code ---------------- -To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the :doc:`reduction tutorial. ` The x input variable is a shared pointer, which needs to be synchronized after every value changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or ``thread_block`` because the ``thread_group`` is the parent class of these types. The ``val`` are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the ``thread_group``, and for every other thread, the function results are 0. +To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the :doc:`reduction tutorial. ` ``x`` input variable is a shared pointer, which needs to be synchronized after every value change. The ``thread_group`` input parameter can be ``thread_block_tile`` or ``thread_block`` because the ``thread_group`` is the parent class of these types. The ``val`` are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the ``thread_group``, and for every other thread, the function results are 0. .. code-block:: cuda @@ -139,7 +139,7 @@ In this code section, the sum is calculated on ``thread_block_group`` level, the 3. The reduction of custom partition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculate on a shorter sequence of input numbers than at the ``thread_block_group`` case. +In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculates on a shorter sequence of input numbers than at the ``thread_block_group`` case. .. code-block:: cuda