Skip to content

Commit

Permalink
Minor improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Jul 30, 2024
1 parent c53f1cc commit 2376afd
Show file tree
Hide file tree
Showing 6 changed files with 21 additions and 25 deletions.
3 changes: 1 addition & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,5 @@

exclude_patterns = [
"doxygen/mainpage.md",
"understand/glossary.md",
"understand/thread_hierarchy_coop_figure.rst"
"understand/glossary.md"
]
1 change: 0 additions & 1 deletion docs/doxygen/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -834,7 +834,6 @@ INPUT = mainpage.md \
../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \
../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \
../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \
../../../llvm-project/clang/lib/Headers/__clang_hip_math.h \
../../../ROCR-Runtime/src/inc/hsa_ext_amd.h

# This tag can be used to specify the character encoding of the source files
Expand Down
24 changes: 12 additions & 12 deletions docs/how-to/cooperative_groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@
Cooperative Groups
*******************************************************************************

Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. The API enables the developers to specify the level of granularity for thread communication which can lead to more efficient parallel decompositions.

.. Maybe this sentence is better: The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than those defined by the hardware.
Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. Cooperative Groups let you define your own set of thread groups which may fit your user-cases better than those defined by the hardware. This lets you specify the level of granularity for thread communication which can lead to more efficient parallel decompositions.

The API is accessible in the ``cooperative_groups`` namespace after the ``hip_cooperative_groups.h`` is included. The header contains the following elements:

Expand All @@ -37,11 +35,13 @@ The thread hierarchy abstraction of cooperative groups is in

Cooperative group thread hierarchy.

The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid.

.. note::

This feature introduced a new level between thread and thread block in the programming model.
The ability to synchronize over a grid or multi grid requires the kernel to be launched using the specific cooperative groups API.

The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. The ability to synchronize over a grid requires the kernel to be launched using the cooperative groups API. The **block** is the same as the :ref:`inherent_thread_model` block entity.
The **block** is the same as the :ref:`inherent_thread_model` block entity.

.. note::

Expand Down Expand Up @@ -92,7 +92,7 @@ are public of the ``grid_group`` class. For further details, check the :ref:`gri
Multi-grid group
------------------

Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. All the multi-grid group APIs require that you have used the appropriate launch API.
Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. Use the cooperative launch API to enable synchronization across the multi-grid group.

.. code-block:: cpp
Expand Down Expand Up @@ -130,15 +130,15 @@ Constructed via:
.. note::

* ``Size`` must be a power of 2 and not larger than warp (wavefront) size.
* ``shfl`` functions support integer or float type.
* Size must be a power of 2 and not larger than warp (wavefront) size.
* ``shfl()`` functions support integer or float type.

The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, ``shfl_up(...)`` and ``shfl_xor(...)`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references <thread_block_tile_ref>` .
The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl()``, ``shfl_down()``, ``shfl_up()``, ``shfl_xor()``, ``ballot()``, ``any()``, ``all()``, ``match_any()`` and ``match_all()`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references <thread_block_tile_ref>` .

Coalesced groups
------------------

Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads referred as coalesced and coalesced groups represents an active thread group within a warp.
Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp.

This group type also supports sub-wave level intrinsics.

Expand All @@ -154,9 +154,9 @@ Constructed via:
.. note::

* ``shfl`` functions support integer or float type.
``shfl()`` functions support integer or float type.

The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, and ``shfl_up(...)`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references <coalesced_group_ref>` .
The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl()``, ``shfl_down()``, ``shfl_up()``, ``ballot()``, ``any()``, ``all()``, ``match_any()`` and ``match_all()`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references <coalesced_group_ref>` .

Cooperative groups simple example
=================================
Expand Down
11 changes: 5 additions & 6 deletions docs/reference/cooperative_groups_reference.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
.. meta::
:description: This chapter lists types and device API wrappers related to the Cooperative Group
feature. Programmers can directly use them in their kernels to make use of this feature.
:description: This chapter lists types and device API wrappers related to the Cooperative Group feature. Programmers can directly use these API features in their kernels.
:keywords: AMD, ROCm, HIP, cooperative groups

.. _cooperative_groups_reference:
Expand All @@ -12,7 +11,7 @@ HIP Cooperative Groups API
Cooperative kernel launches
===========================

The following host side functions used for cooperative kernel launches.
The following host-side functions are used for cooperative kernel launches.

.. doxygenfunction:: hipLaunchCooperativeKernel(const void* f, dim3 gridDim, dim3 blockDimX, void** kernelParams, unsigned int sharedMemBytes, hipStream_t stream)

Expand Down Expand Up @@ -48,7 +47,7 @@ The following cooperative groups classes can be used on the device side.

.. doxygenclass:: cooperative_groups::multi_grid_group
:members:

.. _thread_block_tile_ref:

.. doxygenclass:: cooperative_groups::thread_block_tile
Expand All @@ -62,7 +61,7 @@ The following cooperative groups classes can be used on the device side.
Cooperative groups construct functions
======================================

The following functions are used to construct different group types instances on the device side.
The following functions are used to construct different group-type instances on the device side.

.. doxygenfunction:: cooperative_groups::this_multi_grid

Expand All @@ -83,7 +82,7 @@ The following functions are used to construct different group types instances on
Cooperative groups exposed API functions
========================================

The following functions are the exposed API for different group types instances on the device side.
The following functions are the exposed API for different group-type instances on the device side.

.. doxygenfunction:: cooperative_groups::group_size

Expand Down
1 change: 0 additions & 1 deletion docs/sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ subtrees:
- file: reference/cooperative_groups_reference
title: HIP Cooperative Groups API
- file: reference/virtual_rocr
title: HSA Runtime API for ROCm
- file: reference/unified_memory_reference
title: HIP Managed Memory Allocation API
- file: reference/deprecated_api_list
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial/cooperative_groups_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,10 @@ You can use tiled partition to calculate the sum of ``partition_size`` length se
return result;
}
Device side code
Device-side code
----------------

To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the :doc:`reduction tutorial. <reduction>` The x input variable is a shared pointer, which needs to be synchronized after every value changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or ``thread_block`` because the ``thread_group`` is the parent class of these types. The ``val`` are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the ``thread_group``, and for every other thread, the function results are 0.
To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the :doc:`reduction tutorial. <reduction>` ``x`` input variable is a shared pointer, which needs to be synchronized after every value change. The ``thread_group`` input parameter can be ``thread_block_tile`` or ``thread_block`` because the ``thread_group`` is the parent class of these types. The ``val`` are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the ``thread_group``, and for every other thread, the function results are 0.

.. code-block:: cuda
Expand Down Expand Up @@ -139,7 +139,7 @@ In this code section, the sum is calculated on ``thread_block_group`` level, the
3. The reduction of custom partition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculate on a shorter sequence of input numbers than at the ``thread_block_group`` case.
In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculates on a shorter sequence of input numbers than at the ``thread_block_group`` case.

.. code-block:: cuda
Expand Down

0 comments on commit 2376afd

Please sign in to comment.