Skip to content

Commit

Permalink
PR feedbacks
Browse files Browse the repository at this point in the history
WIP

WIP

WIP

WIP
  • Loading branch information
neon60 committed Jul 24, 2024
1 parent d71d3b7 commit 6a92313
Show file tree
Hide file tree
Showing 9 changed files with 101 additions and 101 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This file was deleted.

51 changes: 37 additions & 14 deletions docs/how-to/cooperative_groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@
:description: This topic describes how to use cooperative groups in HIP
:keywords: AMD, ROCm, HIP, cooperative groups

.. _cooperative_groups_how-to:

*******************************************************************************
Cooperative Groups
*******************************************************************************

Cooperative groups API is an extension to the ROCm programming model. It provides developers with a flexible grouping mechanism. This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform.
Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. The API enables the developers to specify the level of granularity for thread communication which can lead to more efficient parallel decompositions.

.. Maybe this sentence is better: The rich set of APIs introduced by Cooperative Groups allow the programmer to define their own set of thread groups which may fit their user-cases better than those defined by the hardware.
The API is accessible in the ``cooperative_groups`` namespace after the ``hip_cooperative_groups.h`` is included. The header contains the following elements:

Expand All @@ -16,19 +20,37 @@ The API is accessible in the ``cooperative_groups`` namespace after the ``hip_c
* Hardware-accelerated operations over the whole group, like shuffles.
* Static functions to create groups and subgroups.

.. note::

This feature was introduced with ROCm 4.1 at AMD platform and CUDA 9.0 at NVIDIA platform. This introduced a new level between block and thread block level for synchronization.

Cooperative groups thread model
===============================

The thread hierarchy abstraction of cooperative groups is in
:numref:`coop_thread_hierarchy_how_to`.
:numref:`coop_thread_hierarchy`.

.. _coop_thread_hierarchy:

.. _coop_thread_hierarchy_how_to:
.. figure:: ../data/how-to/cooperative_groups/thread_hierarchy_coop.svg
:alt: Diagram depicting nested rectangles of varying color. The outermost one
titled "Grid", inside sets of different sized rectangles layered on
one another titled "Block". Each "Block" containing sets of uniform
rectangles layered on one another titled "Warp". Each of the "Warp"
titled rectangles filled with downward pointing arrows inside.

.. include:: ../understand/thread_hierarchy_coop_figure.rst
Cooperative group thread hierarchy.

The **multi grid** is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The **grid** in cooperative groups is a single dispatch of kernels for execution like the original grid. The ability to synchronize over a grid requires the kernel to be launched using the cooperative groups API. The **block** is the same as the :ref:`inherent_thread_model` block entity.

For further information, check the :ref:`inherent thread model <inherent_thread_model>`. For details on memory model, check the :ref:`memory model description <memory_hierarchy>`
.. note::

Explicit warp-level thread handling is absent from the Cooperative Groups API.
In order to exploit the known hardware SIMD width on which built-in
functionality translates to simpler logic, one may use the group partitioning
part of the API, such as ``tiled_partition``.

For details on memory model, check the :ref:`memory model description <memory_hierarchy>`.

Cooperative groups simple example
=================================
Expand Down Expand Up @@ -98,7 +120,7 @@ The difference to the original block model in the ``reduce_sum`` device function
// ...
}
The ``reduce_sum`` function call and input data initialization difference to the original block model is the following.
The ``reduce_sum()`` function call and input data initialization difference to the original block model is the following.

.. tab-set::
.. tab-item:: Original Block
Expand Down Expand Up @@ -144,7 +166,6 @@ The ``reduce_sum`` function call and input data initialization difference to the
}
At the device function, the input group type is the ``thread_group``, which is the parent class of all the cooperative groups type. With this, you can write generic functions, which can work with any type of cooperative groups.
The kernel launch also differs with cooperative groups, as it depends on the group type. For example, grid groups with a single-GPU, the ``hipLaunchCooperativeKernel`` has to be used.

Group types
===========
Expand All @@ -154,7 +175,7 @@ Group types are based on the levels of synchronization and data sharing among th
Thread-block group
------------------

Represents an intra-workgroup cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``workgroup``.
Represents an intra-block cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing ``block``.

.. code-block:: cpp
Expand All @@ -171,7 +192,7 @@ The ``group_index()`` , ``thread_index()`` , ``thread_rank()`` , ``size()``, ``c
Grid group
------------

Represents an inter-workgroup cooperative groups type where the group's participating threads span multiple workgroups running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid.
Represents an inter-block cooperative groups type where the group's participating threads span multiple blocks running the same kernel on the same device. Use the cooperative launch API to synchronize across the grid.

.. code-block:: cpp
Expand Down Expand Up @@ -227,16 +248,17 @@ Constructed via:
.. note::

* ``Size`` must be a power of 2 and not larger than wavefront size.
* ``Size`` must be a power of 2 and not larger than warp (wavefront) size.
* ``shfl`` functions support integer or float type.

The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, ``shfl_up(...)`` and ``shfl_xor(...)`` member functions are public of the ``thread_block_tile`` class. For further details, check the :ref:`thread_block_tile references <thread_block_tile_ref>` .

Coalesced groups
------------------

Represents an active thread group in a wavefront. This group type also supports sub-wave level
intrinsics.
Threads (64 threads on CDNA and 32 threads on RDNA or NVIDIA GPUs) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads referred as coalesced and coalesced groups represents an active thread group within a warp.

This group type also supports sub-wave level intrinsics.

.. code-block:: cpp
Expand All @@ -254,11 +276,12 @@ Constructed via:

The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ``meta_group_rank()``, ``meta_group_size()``, ``shfl(...)``, ``shfl_down(...)``, and ``shfl_up(...)`` member functions are public of the ``coalesced_group`` class. For more information, see :ref:`coalesced_group references <coalesced_group_ref>` .

.. _coop_synchronization:

Synchronization
===============

With each group type, the synchronization requires using the correct cooperative groups
launch API.
With each group type, the synchronization requires using the correct cooperative groups launch API.

**Check the kernel launch capability**

Expand Down
4 changes: 3 additions & 1 deletion docs/reference/cooperative_groups_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
feature. Programmers can directly use them in their kernels to make use of this feature.
:keywords: AMD, ROCm, HIP, cooperative groups

.. _cooperative_groups_reference:

*******************************************************************************
Cooperative Groups API
HIP Cooperative Groups API
*******************************************************************************

Cooperative kernel launches
Expand Down
12 changes: 10 additions & 2 deletions docs/reference/cpp_language_extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -292,8 +292,7 @@ dimensions to 1.
Memory fence instructions
====================================================

HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in
the HIP-Clang path, you can use the following workaround:
HIP supports ``__threadfence()`` and ``__threadfence_block()``. If you're using ``threadfence_system()`` in the HIP-Clang path, you can use the following workaround:

#. Build HIP with the ``HIP_COHERENT_HOST_ALLOC`` environment variable enabled.
#. Modify kernels that use ``__threadfence_system()`` as follows:
Expand All @@ -306,9 +305,16 @@ the HIP-Clang path, you can use the following workaround:

Synchronization functions
====================================================

There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.

Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.

The ``__syncthreads()`` built-in function is supported in HIP. The ``__syncthreads_count(int)``,
``__syncthreads_and(int)``, and ``__syncthreads_or(int)`` functions are under development.

The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check :ref:`Cooperative Groups API <cooperative_groups_reference>` or :ref:`Cooperative Groups how to <cooperative_groups_how-to>`.

Math functions
====================================================

Expand Down Expand Up @@ -2080,6 +2086,8 @@ HIP supports the following kernel language cooperative groups types and function
- ✓
- ✓

For further information, check :ref:`Cooperative Groups API <cooperative_groups_reference>` or :ref:`Cooperative Groups how to <cooperative_groups_how-to>`.

Warp matrix functions
============================================================

Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ subtrees:
- file: reference/terms
title: Comparing Syntax for different APIs
- file: reference/cooperative_groups_reference
title: Cooperative Groups API
title: HIP Cooperative Groups API
- file: reference/virtual_rocr
title: HSA Runtime API for ROCm
- file: reference/unified_memory_reference
Expand Down
33 changes: 4 additions & 29 deletions docs/understand/programming_model_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,38 +89,13 @@ model.

The rich set of APIs introduced by Cooperative Groups allow the programmer to
define their own set of thread groups which may fit their user-cases better than
those defined by the hardware. The set of implicit groups by kernel launch
parameters are still available.

The thread hierarchy abstraction of Cooperative Groups manifest as depicted in
:numref:`coop_thread_hierarchy`.

.. _coop_thread_hierarchy:

.. include:: thread_hierarchy_coop_figure.rst

Multi Grid
An abstraction of potentially multiple simultaneous launches of
the same kernel over multiple devices. Grids inside a multi device kernel
launch need not be of uniform size, thus allowing taking into account
different device capabilities and preferences.

.. deprecated:: 5.0

Grid
Same as the :ref:`inherent_thread_model` Grid entity. The ability to
synchronize over a grid requires the kernel to be launched using the
Cooperative Groups API.

Block
Same as the :ref:`inherent_thread_model` Block entity.
those defined by the hardware.

.. note::

Explicit warp-level thread handling is absent from the Cooperative Groups API.
In order to exploit the known hardware SIMD width on which built-in
functionality translates to simpler logic, one may use the group partitioning
part of the API, such as ``tiled_partition``.
The set of implicit groups by kernel launch parameters are still available.

For further information, check the :ref:`inherent thread model <coop_thread_hierarchy>`.

Memory Model
============
Expand Down
8 changes: 0 additions & 8 deletions docs/understand/thread_hierarchy_coop_figure.rst

This file was deleted.

0 comments on commit 6a92313

Please sign in to comment.