Skip to content

Commit

Permalink
Doc: Rephrase pinning.
Browse files Browse the repository at this point in the history
  • Loading branch information
kouchy committed Mar 28, 2024
1 parent a1b0cee commit 6d62ec5
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 50 deletions.
3 changes: 3 additions & 0 deletions docs/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
*[ISA]: Instruction Set Architecture
*[ISAs]: Instruction Set Architectures
*[JSON]: JavaScript Object Notation
*[LLC]: Last Level Cache
*[LUT]: Look Up Table
*[LUTs]: Look Up Tables
*[MKL]: Intel Math Kernel Library
Expand All @@ -45,6 +46,7 @@
*[MSVC]: Microsoft Visual C++
*[MT 19937]: Mersenne Twister 19937
*[NEON]: ARM SIMD instructions
*[NUMA]: Non Uniform Memory Access
*[OS]: Operating System
*[OSs]: Operating Systems
*[PRNG]: Pseudo Random Number Generator
Expand All @@ -68,3 +70,4 @@
*[SSE4.1]: Streaming SIMD Extensions 4.1
*[SSE4.2]: Streaming SIMD Extensions 4.2
*[STD]: Standard
*[UMA]: Uniform Memory Access
108 changes: 58 additions & 50 deletions docs/thread_pinning.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Thread Pinning

`AFF3CT-core` enables to select on which CPU process units (PUs) the threads are
`AFF3CT-core` enables to select on which process units (PUs) the threads are
effectively run. This is called *thread pinning* and it can significantly
benefit to the performance, especially on modern heterogeneous architectures.
To do so, the runtime relies on the
Expand All @@ -25,7 +25,7 @@ To do so, the runtime relies on the

*Portable Hardware Locality* (`hwloc` in short) is a library which provides a
**portable abstraction** of the **hierarchical topology of modern
architectures** (see the illustration below).
architectures** (see the figure below).

<figure markdown>
![Orange Pi 5](./assets/hwloc_orangepi5.svg)
Expand All @@ -36,52 +36,53 @@ architectures** (see the illustration below).
</figcaption>
</figure>

`hwloc` gives the ability to pin threads over any level of hierarchy with a tree
view, where the process units are the leaves and there are intern nodes which
represent a set of PUs that are physically close (share the same LLC or are in
the same NUMA node).
`hwloc` gives the ability to pin threads over various level of hierarchy
represented by a tree structure. The deepest/lowest nodes (the leaves) are the
PUs while higher nodes represent sets of PUs that are physically close. For
instance, a PUs set can share the same UMA node (in the case of a NUMA
architecture), the same LLC or the same package.

For instance, we can choose to pin a thread over a *package* and it will be able
to execute on all the PUs that are in this level. In the Orange Pi 5 SBC, if we
choose `Package L#0` the thread will run over the following set of PUs:
`PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`. Consequently, **the pinned thread can
move in the selected `hwloc` object during the execution** and it is up to the
OS to schedule the thread on the available set of PUs.
In the Orange Pi 5 SBC, if we pin a thread on the `Package L#0`, it will run
over the following set of PUs: `PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`.
Thus, **the pinned thread can move in the selected `hwloc` node during the
execution** and it is up to the OS to schedule the thread on the selected PUs
set.

!!! warning
The indexes given by `hwloc` are different from those given by the OS: they
are logical indexes that express the real locality. **Consequently, in
The indexes given by `hwloc` can be different from those given by the OS:
they are logical indexes that express the real locality. **Consequently, in
`AFF3CT-core`, it is important to use `hwloc` logical indexes.** The
`hwloc-ls` command gives an overview of the current topology with these
logical indexes.

## Sequence & Pipeline

In `AFF3CT-core`, the thread pinning can be set in `runtime::Sequence` and
`runtime::Pipeline` classes constructor. In both cases, there is a dedicated
argument of `std::string` type: `sequence_pinning_policy` for
`runtime::Sequence` and `pipeline_pinning_policy` for `runtime::Pipeline`.
In `AFF3CT-core`, thread pinning can be set in `runtime::Sequence` and
`runtime::Pipeline` class constructors. In both cases, there is a dedicated
argument of `std::string` type named `sequence_pinning_policy` for
`runtime::Sequence` or `pipeline_pinning_policy` for `runtime::Pipeline`.

!!! info
It is important to specify the thread pinning at the construction of the
`runtime::Sequence`/`runtime::Pipeline` object to guarantee that the data
will be allocated and initialized (first touch policy) on the right memory
banks during the replication process.
For NUMA architectures, it is important to specify thread pinning at the
construction of the `runtime::Sequence`/`runtime::Pipeline` object to
guarantee that the data will be allocated and initialized on the right
memory banks (according to the first touch policy) during the replication
process.

To specify the pinning policy, we defined a syntax to express `hwloc` with three
different separators:
To specify the pinning policy, we defined a syntax to express `hwloc` objects
with three different separators:

- Pipeline stage (does not concern `runtime::Sequence`): `|`
- Replicated stage (= replicated sequence = one thread): `;`
- For one thread, the list of pinned `hwloc` objects (= logical or): `,`

Then, the pinning can contains all the available `hwloc` objects. Below is
the correspondence between the `std::string` and the `hwloc` objects type
enumerate:
Then, the pinning policy can contains all the available `hwloc` objects. Below
is the correspondence between the `std::string` and the `hwloc` object types:

```cpp
static std::map<std::string, hwloc_obj_type_t> object_map =
{ /* global containers */ /* data caches */ /* instruction caches */
std::map<std::string, hwloc_obj_type_t> str_to_hwloc_obj =
{
/* global containers */ /* data caches */ /* instruction caches */
{ "GROUP", HWLOC_OBJ_GROUP }, { "L5D", HWLOC_OBJ_L5CACHE }, { "L3I", HWLOC_OBJ_L3ICACHE },
{ "NUMA", HWLOC_OBJ_NUMANODE }, { "L4D", HWLOC_OBJ_L4CACHE }, { "L2I", HWLOC_OBJ_L2ICACHE },
{ "PACKAGE", HWLOC_OBJ_PACKAGE }, { "L3D", HWLOC_OBJ_L3CACHE }, { "L1I", HWLOC_OBJ_L1ICACHE },
Expand All @@ -91,26 +92,24 @@ static std::map<std::string, hwloc_obj_type_t> object_map =
};
```

The following syntax is used to specify the object index `X`: `OBJECT_X`.

`OBJECT` can be all the `std::string` defined in the previous listing
(ex: `PU_10` refers to the logical process unit n°10).
To specify the index `X` of an `hwloc` object, the following syntax is used:
`OBJECT_X` (ex: `PU_5` refers to the logical PU n°5).

!!! info
`CORE` and `PU` objects can be confusing. If the CPU cores does not support
`CORE` and `PU` objects can be confusing. If the CPU cores do not support
SMT, then `CORE` and `PU` are the same. However, if the CPU cores support
SMT, then the `PU` is the hardware thread identifier inside a given `CORE`.

### Illustrative Examples

The section proposes some examples to understand how the syntax works. Only the
simplest `hwloc` object is used: the `PU`. Let's suppose that we have a
octo-core CPU with 8 process units (`PU_0, PU_1, PU_2, PU_3, PU_4, PU_5, PU_6,
PU_7`), see the topology of the Orange Pi 5 Plus above).
This section gives some examples to understand how the syntax works. We
suppose that we have a CPU with 8 PUs with the same topology as the the Orange
Pi 5 Plus SBC presented before.

#### Example 1

We want to describe a 3 stages pipeline with:
Let's suppose we want to setup a 3-stage pipeline with the following
characteristics:

- **Stage 1** - No replication (= 1 thread):
- Pinned to `PU_0`
Expand All @@ -136,15 +135,18 @@ S2T4(Stage 2, thread 4 - pin: PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```

The input parameters will be:
In the previous configuration, 6 threads will execute simultaneously (even if
the given architecture supports up to 8 executions in parallel).

To instantiate this `runtime::Pipeline`, here are the corresponding constructor
parameters:

- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
- Enabling pinning: `{ true, true, true }`
- Enabling pinning per stage: `{ true, true, true }`
- Pinning policy:
`"PU_0 | PU_4, PU_5; PU_4, PU_5; PU_6, PU_7; PU_6, PU_7 | PU_0, PU_1, PU_2, PU_3"`

The previous pinning policy syntax can be compressed a little bit. It is
possible to use the following equivalent `std::string`:
The previous pinning policy syntax can be compressed a little bit as follow:

- Pinning policy :
`"PU_0 | PACKAGE_1; PACKAGE_1; PACKAGE_2; PACKAGE_2 | PACKAGE_0"`
Expand All @@ -153,7 +155,7 @@ possible to use the following equivalent `std::string`:

Let's now consider that we want to pin all the threads of the stage 2 on the
`PU_4`, `PU_5`, `PU_6` or `PU_7` (this is less restrictive than the previous
example). The pinning strategy for stage 1 and 3 is the same as before.
example). The pinning strategy for stage 1 and 3 is unchanged.

```mermaid
graph LR;
Expand All @@ -169,19 +171,23 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```

Here are the corresponding parameters:

- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
- Enabling pinning per stage: `{ true, true, true }`
- Pinning policy : `"PU_0 | PACKAGE_1, PACKAGE_2 | PACKAGE_0"`

With the previous syntax, the 4 threads of the stage 2 will apply the
`PACKAGE_1, PACKAGE_2` policy.

#### Example 3

It is also possible to choose the stages we want to pin using a vector of
`boolean`. For instance, if we don't want to pin the first stage, we can do:
It is also possible to choose the stages we want to pin or not using a vector of
`boolean`. Let's suppose we do not want to specify any pinning for the stage 1.

```mermaid
graph LR;
S1T1(Stage 1, thread 1 - no pin)-->SYNC1;
S1T1(Stage 1, thread 1 - no pinning)-->SYNC1;
SYNC1(Sync)-->S2T1;
SYNC1(Sync)-->S2T2;
SYNC1(Sync)-->S2T3;
Expand All @@ -193,11 +199,13 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```

- Enabling pinning: `{false, true, true}`
Here are the corresponding parameters:

- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
- Enabling pinning per stage: `{false, true, true}`
- Pinning policy: `"| PACKAGE_1, PACKAGE_2 | PACKAGE_0"`

Thus, the operating system will be in charge of pinning the thread of the first
stage.
In this case, the OS will be in charge of pinning the thread of the first stage.

### Unpin

Expand Down

0 comments on commit 6d62ec5

Please sign in to comment.