From 6a92313c3a432aba28d3e7818bcdf1a50e1f6be9 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Tue, 23 Jul 2024 14:39:02 +0200 Subject: [PATCH] PR feedbacks WIP WIP WIP WIP --- .../thread_hierarchy_coop.drawio | 90 +++++++++---------- .../thread_hierarchy_coop.svg | 1 + .../thread_hierarchy_coop.svg | 1 - docs/how-to/cooperative_groups.rst | 51 ++++++++--- .../cooperative_groups_reference.rst | 4 +- docs/reference/cpp_language_extensions.rst | 12 ++- docs/sphinx/_toc.yml.in | 2 +- .../programming_model_reference.rst | 33 +------ .../thread_hierarchy_coop_figure.rst | 8 -- 9 files changed, 101 insertions(+), 101 deletions(-) rename docs/data/{understand/programming_model_reference => how-to/cooperative_groups}/thread_hierarchy_coop.drawio (98%) create mode 100644 docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg delete mode 100644 docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg delete mode 100644 docs/understand/thread_hierarchy_coop_figure.rst diff --git a/docs/data/understand/programming_model_reference/thread_hierarchy_coop.drawio b/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.drawio similarity index 98% rename from docs/data/understand/programming_model_reference/thread_hierarchy_coop.drawio rename to docs/data/how-to/cooperative_groups/thread_hierarchy_coop.drawio index fb4c19fef9..e4c0c90d2d 100644 --- a/docs/data/understand/programming_model_reference/thread_hierarchy_coop.drawio +++ b/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.drawio @@ -1,6 +1,6 @@ - + @@ -1411,7 +1411,7 @@ - + @@ -1591,7 +1591,7 @@ - + @@ -1762,7 +1762,7 @@ - + @@ -1876,7 +1876,7 @@ - + @@ -2047,7 +2047,7 @@ - + @@ -3490,7 +3490,7 @@ - + @@ -3670,7 +3670,7 @@ - + @@ -3841,7 +3841,7 @@ - + @@ -3955,7 +3955,7 @@ - + @@ -4126,7 +4126,7 @@ - + @@ -4534,7 +4534,7 @@ - + @@ -4600,7 +4600,7 @@ - + @@ -4771,7 +4771,7 @@ - + @@ -4933,7 +4933,7 @@ - + @@ -4984,163 +4984,163 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + diff --git a/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg b/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg new file mode 100644 index 0000000000..199f92306a --- /dev/null +++ b/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg @@ -0,0 +1 @@ +Grid
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
GridMulti Grid
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg b/docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg deleted file mode 100644 index a3f57994fb..0000000000 --- a/docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg +++ /dev/null @@ -1 +0,0 @@ -Grid
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Cluster
Cluster
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Block
Block
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Cluster
Cluster
Warp
Warp
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
GridMulti Grid
Cluster
Cluster
Warp
Warp
Block
Block
Warp
Warp
Warp
Warp
Block
Block
Block
Block
Block
Block
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/cooperative_groups.rst index 95e4082702..78cc0fcc4c 100644 --- a/docs/how-to/cooperative_groups.rst +++ b/docs/how-to/cooperative_groups.rst @@ -2,11 +2,15 @@ :description: This topic describes how to use cooperative groups in HIP :keywords: AMD, ROCm, HIP, cooperative groups +.. _cooperative_groups_how-to: + ******************************************************************************* Cooperative Groups ******************************************************************************* -Cooperative groups API is an extension to the ROCm programming model. It provides developers with a flexible grouping mechanism. This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform. +Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. The API enables the developers to specify the level of granularity for thread communication which can lead to more efficient parallel decompositions. + +.. Maybe this sentence is better: The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than those defined by the hardware. The API is accessible in the ``cooperative_groups`` namespace after the ``hip_cooperative_groups.h`` is included. The header contains the following elements: @@ -16,19 +20,37 @@ The API is accessible in the ``cooperative_groups`` namespace after the ``hip_c * Hardware-accelerated operations over the whole group, like shuffles. * Static functions to create groups and subgroups. +.. note:: + + This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform. This introduced a new level between block and thread block level for synchronization. + Cooperative groups thread model =============================== The thread hierarchy abstraction of cooperative groups is in -:numref:`coop_thread_hierarchy_how_to`. +:numref:`coop_thread_hierarchy`. + +.. _coop_thread_hierarchy: -.. _coop_thread_hierarchy_how_to: +.. figure:: ../data/how-to/cooperative_groups/thread_hierarchy_coop.svg + :alt: Diagram depicting nested rectangles of varying color. The outermost one + titled "Grid", inside sets of different sized rectangles layered on + one another titled "Block". Each "Block" containing sets of uniform + rectangles layered on one another titled "Warp". Each of the "Warp" + titled rectangles filled with downward pointing arrows inside. -.. include:: ../understand/thread_hierarchy_coop_figure.rst + Cooperative group thread hierarchy. The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. The ability to synchronize over a grid requires the kernel to be launched using the cooperative groups API. The **block** is the same as the :ref:`inherent_thread_model` block entity. -For further information, check the :ref:`inherent thread model `. For details on memory model, check the :ref:`memory model description ` +.. note:: + + Explicit warp-level thread handling is absent from the Cooperative Groups API. + In order to exploit the known hardware SIMD width on which built-in + functionality translates to simpler logic, one may use the group partitioning + part of the API, such as ``tiled_partition``. + +For details on memory model, check the :ref:`memory model description `. Cooperative groups simple example ================================= @@ -98,7 +120,7 @@ The difference to the original block model in the ``reduce_sum`` device function // ... } -The ``reduce_sum`` function call and input data initialization difference to the original block model is the following. +The ``reduce_sum()`` function call and input data initialization difference to the original block model is the following. .. tab-set:: .. tab-item:: Original Block @@ -144,7 +166,6 @@ The ``reduce_sum`` function call and input data initialization difference to the } At the device function, the input group type is the ``thread_group``, which is the parent class of all the cooperative groups type. With this, you can write generic functions, which can work with any type of cooperative groups. -The kernel launch also differs with cooperative groups, as it depends on the group type. For example, grid groups with a single-GPU, the ``hipLaunchCooperativeKernel`` has to be used. Group types =========== @@ -154,7 +175,7 @@ Group types are based on the levels of synchronization and data sharing among th Thread-block group ------------------ -Represents an intra-workgroup cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``workgroup``. +Represents an intra-block cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``block``. .. code-block:: cpp @@ -171,7 +192,7 @@ The ``group_index()`` , ``thread_index()`` , ``thread_rank()`` , ``size()``, ``c Grid group ------------ -Represents an inter-workgroup cooperative groups type where the group's participating threads span multiple workgroups running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid. +Represents an inter-block cooperative groups type where the group's participating threads span multiple blocks running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid. .. code-block:: cpp @@ -227,7 +248,7 @@ Constructed via: .. note:: - * ``Size`` must be a power of 2 and not larger than wavefront size. + * ``Size`` must be a power of 2 and not larger than warp (wavefront) size. * ``shfl`` functions support integer or float type. The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, ``shfl_up(...)`` and ``shfl_xor(...)`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references ` . @@ -235,8 +256,9 @@ The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ` Coalesced groups ------------------ -Represents an active thread group in a wavefront. This group type also supports sub-wave level -intrinsics. +Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads referred as coalesced and coalesced groups represents an active thread group within a warp. + +This group type also supports sub-wave level intrinsics. .. code-block:: cpp @@ -254,11 +276,12 @@ Constructed via: The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, and ``shfl_up(...)`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references ` . +.. _coop_synchronization: + Synchronization =============== -With each group type, the synchronization requires using the correct cooperative groups -launch API. +With each group type, the synchronization requires using the correct cooperative groups launch API. **Check the kernel launch capability** diff --git a/docs/reference/cooperative_groups_reference.rst b/docs/reference/cooperative_groups_reference.rst index d02dd2ae95..f044b69219 100644 --- a/docs/reference/cooperative_groups_reference.rst +++ b/docs/reference/cooperative_groups_reference.rst @@ -3,8 +3,10 @@ feature. Programmers can directly use them in their kernels to make use of this feature. :keywords: AMD, ROCm, HIP, cooperative groups +.. _cooperative_groups_reference: + ******************************************************************************* -Cooperative Groups API +HIP Cooperative Groups API ******************************************************************************* Cooperative kernel launches diff --git a/docs/reference/cpp_language_extensions.rst b/docs/reference/cpp_language_extensions.rst index cd0527e58a..e44031875c 100644 --- a/docs/reference/cpp_language_extensions.rst +++ b/docs/reference/cpp_language_extensions.rst @@ -292,8 +292,7 @@ dimensions to 1. Memory fence instructions ==================================================== -HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in -the HIP-Clang path, you can use the following workaround: +HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in the HIP-Clang path, you can use the following workaround: #. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled. #. Modify kernels that use ``__threadfence_system()`` as follows: @@ -306,9 +305,16 @@ the HIP-Clang path, you can use the following workaround: Synchronization functions ==================================================== + +There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group. + +Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group. + The ``__syncthreads()`` built-in function is supported in HIP. The ``__syncthreads_count(int)``, ``__syncthreads_and(int)``, and ``__syncthreads_or(int)`` functions are under development. +The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check :ref:`Cooperative Groups API ` or :ref:`Cooperative Groups how to `. + Math functions ==================================================== @@ -2080,6 +2086,8 @@ HIP supports the following kernel language cooperative groups types and function - ✓ - ✓ +For further information, check :ref:`Cooperative Groups API ` or :ref:`Cooperative Groups how to `. + Warp matrix functions ============================================================ diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 541b2eaf35..1faed55e12 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -42,7 +42,7 @@ subtrees: - file: reference/terms title: Comparing Syntax for different APIs - file: reference/cooperative_groups_reference - title: Cooperative Groups API + title: HIP Cooperative Groups API - file: reference/virtual_rocr title: HSA Runtime API for ROCm - file: reference/unified_memory_reference diff --git a/docs/understand/programming_model_reference.rst b/docs/understand/programming_model_reference.rst index 8f386b69d8..439a7fa92f 100644 --- a/docs/understand/programming_model_reference.rst +++ b/docs/understand/programming_model_reference.rst @@ -89,38 +89,13 @@ model. The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than -those defined by the hardware. The set of implicit groups by kernel launch -parameters are still available. - -The thread hierarchy abstraction of Cooperative Groups manifest as depicted in -:numref:`coop_thread_hierarchy`. - -.. _coop_thread_hierarchy: - -.. include:: thread_hierarchy_coop_figure.rst - -Multi Grid - An abstraction of potentially multiple simultaneous launches of - the same kernel over multiple devices. Grids inside a multi device kernel - launch need not be of uniform size, thus allowing taking into account - different device capabilities and preferences. - - .. deprecated:: 5.0 - -Grid - Same as the :ref:`inherent_thread_model` Grid entity. The ability to - synchronize over a grid requires the kernel to be launched using the - Cooperative Groups API. - -Block - Same as the :ref:`inherent_thread_model` Block entity. +those defined by the hardware. .. note:: - Explicit warp-level thread handling is absent from the Cooperative Groups API. - In order to exploit the known hardware SIMD width on which built-in - functionality translates to simpler logic, one may use the group partitioning - part of the API, such as ``tiled_partition``. + The set of implicit groups by kernel launch parameters are still available. + +For further information, check the :ref:`inherent thread model `. Memory Model ============ diff --git a/docs/understand/thread_hierarchy_coop_figure.rst b/docs/understand/thread_hierarchy_coop_figure.rst deleted file mode 100644 index 203f3b5335..0000000000 --- a/docs/understand/thread_hierarchy_coop_figure.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. figure:: ../data/understand/programming_model_reference/thread_hierarchy_coop.svg - :alt: Diagram depicting nested rectangles of varying color. The outermost one - titled "Grid", inside sets of different sized rectangles layered on - one another titled "Block". Each "Block" containing sets of uniform - rectangles layered on one another titled "Warp". Each of the "Warp" - titled rectangles filled with downward pointing arrows inside. - - Cooperative group thread hierarchy.