PR feedbacks

WIP WIP WIP WIP
ROCm · Jul 24, 2024 · 6a92313 · 6a92313
1 parent d71d3b7
commit 6a92313
Show file tree

Hide file tree

Showing 9 changed files with 101 additions and 101 deletions.
diff --git a/...el_reference/thread_hierarchy_coop.drawio → ...ative_groups/thread_hierarchy_coop.drawio b/...el_reference/thread_hierarchy_coop.drawio → ...ative_groups/thread_hierarchy_coop.drawio
diff --git a/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg b/docs/data/how-to/cooperative_groups/thread_hierarchy_coop.svg
diff --git a/docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg b/docs/data/understand/programming_model_reference/thread_hierarchy_coop.svg
diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/cooperative_groups.rst
@@ -2,11 +2,15 @@
   :description: This topic describes how to use cooperative groups in HIP
   :keywords: AMD, ROCm, HIP, cooperative groups
 
+.. _cooperative_groups_how-to:
+
 *******************************************************************************
 Cooperative Groups
 *******************************************************************************
 
-Cooperative groups API is an extension to the ROCm programming model. It provides developers with a flexible grouping mechanism. This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform.
+Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. The API enables the developers to specify the level of granularity for thread communication which can lead to more efficient parallel decompositions.
+
+.. Maybe this sentence is better: The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than those defined by the hardware.
 
 The API is accessible in the ``cooperative_groups`` namespace after the  ``hip_cooperative_groups.h`` is included. The header contains the following  elements:
 
@@ -16,19 +20,37 @@ The API is accessible in the ``cooperative_groups`` namespace after the  ``hip_c
 * Hardware-accelerated operations over the whole group, like shuffles.
 * Static functions to create groups and subgroups.
 
+.. note::
+
+  This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform. This introduced a new level between block and thread block level for synchronization.
+
 Cooperative groups thread model
 ===============================
 
 The thread hierarchy abstraction of cooperative groups is in
-:numref:`coop_thread_hierarchy_how_to`.
+:numref:`coop_thread_hierarchy`.
+
+.. _coop_thread_hierarchy:
 
-.. _coop_thread_hierarchy_how_to:
+.. figure:: ../data/how-to/cooperative_groups/thread_hierarchy_coop.svg
+  :alt: Diagram depicting nested rectangles of varying color. The outermost one
+        titled "Grid", inside sets of different sized rectangles layered on
+        one another titled "Block". Each "Block" containing sets of uniform
+        rectangles layered on one another titled "Warp". Each of the "Warp"
+        titled rectangles filled with downward pointing arrows inside.
 
-.. include:: ../understand/thread_hierarchy_coop_figure.rst
+  Cooperative group thread hierarchy.
 
 The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. The ability to synchronize over a grid requires the kernel to be launched using the cooperative groups API. The **block** is the same as the :ref:`inherent_thread_model` block entity.
 
-For further information, check the :ref:`inherent thread model <inherent_thread_model>`. For details on memory model, check the :ref:`memory model description <memory_hierarchy>`
+.. note::
+
+  Explicit warp-level thread handling is absent from the Cooperative Groups API.
+  In order to exploit the known hardware SIMD width on which built-in
+  functionality translates to simpler logic, one may use the group partitioning
+  part of the API, such as ``tiled_partition``.
+
+For details on memory model, check the :ref:`memory model description <memory_hierarchy>`.
 
 Cooperative groups simple example
 =================================
@@ -98,7 +120,7 @@ The difference to the original block model in the ``reduce_sum`` device function
           // ...
       }
 
-The ``reduce_sum`` function call and input data initialization difference to the original block model is the following.
+The ``reduce_sum()`` function call and input data initialization difference to the original block model is the following.
 
 .. tab-set::
   .. tab-item:: Original Block
@@ -144,7 +166,6 @@ The ``reduce_sum`` function call and input data initialization difference to the
       }
 
 At the device function, the input group type is the ``thread_group``, which is the parent class of all the cooperative groups type. With this, you can write generic functions, which can work with any type of cooperative groups.
-The kernel launch also differs with cooperative groups, as it depends on the group type. For example, grid groups with a single-GPU, the ``hipLaunchCooperativeKernel`` has to be used.
 
 Group types
 ===========
@@ -154,7 +175,7 @@ Group types are based on the levels of synchronization and data sharing among th
 Thread-block group
 ------------------
 
-Represents an intra-workgroup cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``workgroup``.
+Represents an intra-block cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``block``.
 
 .. code-block:: cpp
 
@@ -171,7 +192,7 @@ The ``group_index()`` , ``thread_index()`` , ``thread_rank()`` , ``size()``, ``c
 Grid group
 ------------
 
-Represents an inter-workgroup cooperative groups type where the group's participating threads span multiple workgroups running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid.
+Represents an inter-block cooperative groups type where the group's participating threads span multiple blocks running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid.
 
 .. code-block:: cpp
 
@@ -227,16 +248,17 @@ Constructed via:
 
 .. note::
 
-  * ``Size`` must be a power of 2 and not larger than wavefront size.
+  * ``Size`` must be a power of 2 and not larger than warp (wavefront) size.
   * ``shfl`` functions support integer or float type.
 
 The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, ``shfl_up(...)`` and ``shfl_xor(...)`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references <thread_block_tile_ref>` .
 
 Coalesced groups
 ------------------
 
-Represents an active thread group in a wavefront. This group type also supports sub-wave level
-intrinsics.
+Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads referred as coalesced and coalesced groups represents an active thread group within a warp. 
+
+This group type also supports sub-wave level intrinsics.
 
 .. code-block:: cpp
 
@@ -254,11 +276,12 @@ Constructed via:
 
 The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, and ``shfl_up(...)`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references <coalesced_group_ref>` .
 
+.. _coop_synchronization:
+
 Synchronization
 ===============
 
-With each group type, the synchronization requires using the correct cooperative groups
-launch API.
+With each group type, the synchronization requires using the correct cooperative groups launch API.
 
 **Check the kernel launch capability**
 

diff --git a/docs/reference/cooperative_groups_reference.rst b/docs/reference/cooperative_groups_reference.rst
@@ -3,8 +3,10 @@
                 feature. Programmers can directly use them in their kernels to make use of this feature.
   :keywords: AMD, ROCm, HIP, cooperative groups
 
+.. _cooperative_groups_reference:
+
 *******************************************************************************
-Cooperative Groups API
+HIP Cooperative Groups API
 *******************************************************************************
 
 Cooperative kernel launches

diff --git a/docs/reference/cpp_language_extensions.rst b/docs/reference/cpp_language_extensions.rst
@@ -292,8 +292,7 @@ dimensions to 1.
 Memory fence instructions
 ====================================================
 
-HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in
-the HIP-Clang path, you can use the following workaround:
+HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in the HIP-Clang path, you can use the following workaround:
 
 #. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled.
 #. Modify kernels that use ``__threadfence_system()`` as follows:
@@ -306,9 +305,16 @@ the HIP-Clang path, you can use the following workaround:
 
 Synchronization functions
 ====================================================
+
+There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
+
+Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
+
 The ``__syncthreads()`` built-in function is supported in HIP. The ``__syncthreads_count(int)``,
 ``__syncthreads_and(int)``, and ``__syncthreads_or(int)`` functions are under development.
 
+The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check :ref:`Cooperative Groups API <cooperative_groups_reference>` or :ref:`Cooperative Groups how to <cooperative_groups_how-to>`.
+
 Math functions
 ====================================================
 
@@ -2080,6 +2086,8 @@ HIP supports the following kernel language cooperative groups types and function
       - ✓
       - ✓
 
+For further information, check :ref:`Cooperative Groups API <cooperative_groups_reference>` or :ref:`Cooperative Groups how to <cooperative_groups_how-to>`.
+
 Warp matrix functions
 ============================================================
 

diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
@@ -42,7 +42,7 @@ subtrees:
   - file: reference/terms
     title: Comparing Syntax for different APIs
   - file: reference/cooperative_groups_reference
-    title: Cooperative Groups API
+    title: HIP Cooperative Groups API
   - file: reference/virtual_rocr
     title: HSA Runtime API for ROCm
   - file: reference/unified_memory_reference

diff --git a/docs/understand/programming_model_reference.rst b/docs/understand/programming_model_reference.rst
@@ -89,38 +89,13 @@ model.
 
 The rich set of APIs introduced by Cooperative Groups allow the programmer to
 define their own set of thread groups which may fit their user-cases better than
-those defined by the hardware. The set of implicit groups by kernel launch
-parameters are still available.
-
-The thread hierarchy abstraction of Cooperative Groups manifest as depicted in
-:numref:`coop_thread_hierarchy`.
-
-.. _coop_thread_hierarchy:
-
-.. include:: thread_hierarchy_coop_figure.rst
-
-Multi Grid
-  An abstraction of potentially multiple simultaneous launches of
-  the same kernel over multiple devices. Grids inside a multi device kernel
-  launch need not be of uniform size, thus allowing taking into account
-  different device capabilities and preferences.
-
-  .. deprecated:: 5.0
-
-Grid
-  Same as the :ref:`inherent_thread_model` Grid entity. The ability to
-  synchronize over a grid requires the kernel to be launched using the
-  Cooperative Groups API.
-
-Block
-  Same as the :ref:`inherent_thread_model` Block entity.
+those defined by the hardware.
 
 .. note::
 
-  Explicit warp-level thread handling is absent from the Cooperative Groups API.
-  In order to exploit the known hardware SIMD width on which built-in
-  functionality translates to simpler logic, one may use the group partitioning
-  part of the API, such as ``tiled_partition``.
+  The set of implicit groups by kernel launch parameters are still available.
+
+For further information, check the :ref:`inherent thread model <coop_thread_hierarchy>`. 
 
 Memory Model
 ============

diff --git a/docs/understand/thread_hierarchy_coop_figure.rst b/docs/understand/thread_hierarchy_coop_figure.rst