-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
38 changed files
with
7,122 additions
and
9 deletions.
There are no files selected for viewing
Validating CODEOWNERS rules …
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,10 @@ | |
# Python cache files | ||
*.pyc | ||
|
||
# Documentation artifacts | ||
/_build | ||
_toc.yml | ||
|
||
/build* | ||
/.vscode | ||
/.cache | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Read the Docs configuration file | ||
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details | ||
|
||
version: 2 | ||
|
||
build: | ||
os: ubuntu-22.04 | ||
tools: | ||
python: "3.10" | ||
|
||
python: | ||
install: | ||
- requirements: docs/sphinx/requirements.txt | ||
|
||
sphinx: | ||
configuration: docs/conf.py | ||
|
||
formats: [] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
_build/ | ||
_doxygen/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
.. meta:: | ||
:description: Omnitrace documentation and reference | ||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD | ||
|
||
********************** | ||
Data collection modes | ||
********************** | ||
|
||
Omnitrace supports several modes of recording trace and profiling data for your application. | ||
|
||
.. note:: | ||
|
||
For an explanation of the terms used in this topic, see | ||
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`. | ||
|
||
+-----------------------------+---------------------------------------------------------+ | ||
| Mode | Description | | ||
+=============================+=========================================================+ | ||
| Binary Instrumentation | Locates functions (and loops, if desired) in the binary | | ||
| | and inserts snippets at the entry and exit | | ||
+-----------------------------+---------------------------------------------------------+ | ||
| Statistical Sampling | Periodically pauses application at specified intervals | | ||
| | and records various metrics for the given call stack | | ||
+-----------------------------+---------------------------------------------------------+ | ||
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos | | ||
| | make callbacks into Omnitrace to provide information | | ||
| | about the work the API is performing | | ||
+-----------------------------+---------------------------------------------------------+ | ||
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent | | ||
| | dynamic library/executable, like ``pthread_mutex_lock`` | | ||
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library | | ||
+-----------------------------+---------------------------------------------------------+ | ||
| User API | User-defined regions and controls for Omnitrace | | ||
+-----------------------------+---------------------------------------------------------+ | ||
|
||
The two most generic and important modes are binary instrumentation and statistical sampling. | ||
It is important to understand their advantages and disadvantages. | ||
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument`` | ||
executable. For statistical sampling, it's highly recommended to use the | ||
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed. | ||
Callback APIs and dynamic symbol interception can be utilized with either tool. | ||
|
||
Binary instrumentation | ||
----------------------------------- | ||
|
||
Binary instrumentation lets you record deterministic measurements for | ||
every single invocation of a given function. | ||
Binary instrumentation effectively adds instructions to the target application to | ||
collect the required information. It therefore has the potential to cause performance | ||
changes which might, in some cases, lead to inaccurate results. The effect depends on | ||
the information being collected and which features are activated in Omnitrace. | ||
For example, collecting only the wall-clock timing data | ||
has less of an effect than collecting the wall-clock timing, CPU-clock timing, | ||
memory usage, cache-misses, and number of instructions that were run. Similarly, | ||
collecting a flat profile has less overhead than a hierarchical profile | ||
and collecting a trace OR a profile has less overhead than collecting a | ||
trace AND a profile. | ||
|
||
In Omnitrace, the primary heuristic for controlling the overhead with binary | ||
instrumentation is the minimum number of instructions for selecting functions | ||
for instrumentation. | ||
|
||
Statistical sampling | ||
----------------------------------- | ||
|
||
Statistical call-stack sampling periodically interrupts the application at | ||
regular intervals using operating system interrupts. | ||
Sampling is typically less numerically accurate and specific, but the | ||
target program runs at nearly full speed. | ||
In contrast to the data derived from binary instrumentation, the resulting | ||
data is not exact but is instead a statistical approximation. | ||
However, sampling often provides a more accurate picture of the application | ||
execution because it is less intrusive to the target application and has fewer | ||
side effects on memory caches or instruction decoding pipelines. Furthermore, | ||
because sampling does not affect the execution speed as much, is it | ||
relatively immune to over-evaluating the cost of small, frequently called | ||
functions or "tight" loops. | ||
|
||
In Omnitrace, the overhead for statistical sampling depends on the | ||
sampling rate and whether the samples are taken with respect to the CPU time | ||
and/or real time. | ||
|
||
Binary instrumentation vs. statistical sampling example | ||
------------------------------------------------------- | ||
|
||
Consider the following code: | ||
|
||
.. code-block:: c++ | ||
|
||
long fib(long n) | ||
{ | ||
if(n < 2) return n; | ||
return fib(n - 1) + fib(n - 2); | ||
} | ||
|
||
void run(long n) | ||
{ | ||
long result = fib(n); | ||
printf("[%li] fibonacci(%li) = %li\n", i, n, result); | ||
} | ||
|
||
int main(int argc, char** argv) | ||
{ | ||
long nfib = 30; | ||
long nitr = 10; | ||
if(argc > 1) nfib = atol(argv[1]); | ||
if(argc > 2) nitr = atol(argv[2]); | ||
|
||
for(long i = 0; i < nitr; ++i) | ||
run(nfib); | ||
|
||
return 0; | ||
} | ||
|
||
Binary instrumentation of the ``fib`` function will record **every single invocation** | ||
of the function. For a very small function | ||
such as ``fib``, this results in **significant** overhead since this simple function | ||
takes about 20 instructions, whereas the entry and | ||
exit snippets are ~1024 instructions. Therefore, you generally want to avoid | ||
instrumenting functions where the instrumented function has significantly fewer | ||
instructions than entry and exit instrumentation. (Note that many of the | ||
instructions in entry and exit functions are either logging functions or | ||
depend on the runtime settings and thus might never run). However, | ||
due to the number of potential instructions in the entry and exit snippets, | ||
the default behavior of ``omnitrace-instrument`` is to only instrument functions | ||
which contain fewer than 1024 instructions. | ||
|
||
However, recording every single invocation of the function can be extremely | ||
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger | ||
than the average or a high standard deviation. In this case, the traces help you | ||
identify exactly when and where those instances deviated from the norm. | ||
Compare the level of detail in the following traces. In the top image, | ||
every instance of the ``fib`` function is instrumented, while in the bottom image, | ||
the ``fib`` call-stack is derived via sampling. | ||
|
||
Binary instrumentation of the Fibonacci function | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
.. image:: ../data/fibonacci-instrumented.png | ||
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function | ||
|
||
Statistical sampling of the Fibonacci function | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
.. image:: ../data/fibonacci-sampling.png | ||
:alt: Visualization of the output of a statistical sample of the Fibonacci function |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
.. meta:: | ||
:description: Omnitrace documentation and reference | ||
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD | ||
|
||
*************************************** | ||
The Omnitrace feature set and use cases | ||
*************************************** | ||
|
||
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible. | ||
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_ | ||
to manage extensions, resources, data, and other items. It supports the following features, | ||
modes, metrics, and APIs. | ||
|
||
Data collection modes | ||
======================================== | ||
|
||
* Dynamic instrumentation | ||
|
||
* Runtime instrumentation: Instrument executables and shared libraries at runtime | ||
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in | ||
|
||
* Statistical sampling: Periodic software interrupts per-thread | ||
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs | ||
* Causal profiling: Quantifies the potential impact of optimizations in parallel code | ||
|
||
.. note:: | ||
|
||
Critical trace support was removed in Omnitrace v1.11.0. | ||
It was replaced by the causal profiling feature. | ||
|
||
Data analysis | ||
======================================== | ||
|
||
* High-level summary profiles with mean, min, max, and standard deviation statistics | ||
|
||
* Low overhead and memory efficient | ||
* Ideal for running at scale | ||
|
||
* Comprehensive traces for every individual event and measurement | ||
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling | ||
|
||
Parallelism API support | ||
======================================== | ||
|
||
* HIP | ||
* HSA | ||
* Pthreads | ||
* MPI | ||
* Kokkos-Tools (KokkosP) | ||
* OpenMP-Tools (OMPT) | ||
|
||
GPU metrics | ||
======================================== | ||
|
||
* GPU hardware counters | ||
* HIP API tracing | ||
* HIP kernel tracing | ||
* HSA API tracing | ||
* HSA operation tracing | ||
* System-level sampling (via rocm-smi) | ||
|
||
* Memory usage | ||
* Power usage | ||
* Temperature | ||
* Utilization | ||
|
||
CPU metrics | ||
======================================== | ||
|
||
* CPU hardware counters sampling and profiles | ||
* CPU frequency sampling | ||
* Various timing metrics | ||
|
||
* Wall time | ||
* CPU time (process and thread) | ||
* CPU utilization (process and thread) | ||
* User CPU time | ||
* Kernel CPU time | ||
|
||
* Various memory metrics | ||
|
||
* High-water mark (sampling and profiles) | ||
* Memory page allocation | ||
* Virtual memory usage | ||
|
||
* Network statistics | ||
* I/O metrics | ||
* Many others | ||
|
||
Third-party API support | ||
======================================== | ||
|
||
* TAU | ||
* LIKWID | ||
* Caliper | ||
* CrayPAT | ||
* VTune | ||
* NVTX | ||
* ROCTX | ||
|
||
Omnitrace use cases | ||
======================================== | ||
|
||
When analyzing the performance of an application, do NOT | ||
assume you know where the performance bottlenecks are | ||
and why they are happening. Omnitrace is a tool for analyzing the entire | ||
application and its performance. It is | ||
ideal for characterizing where optimization would have the greatest impact | ||
on an end-to-end run of the application and for | ||
viewing what else is happening on the system during a performance bottleneck. | ||
|
||
When GPUs are involved, there is a tendency to assume that | ||
the quickest path to performance improvement is minimizing | ||
the runtime of the GPU kernels. This is a highly flawed assumption. | ||
If you optimize the runtime of a kernel from one millisecond | ||
to 1 microsecond (1000x speed-up) but the original application never | ||
spent time waiting for kernels to complete, | ||
there would be no statistically significant reduction in the end-to-end | ||
runtime of your application. In other words, it does not matter | ||
how fast or slow the code on GPU is if the application has a | ||
bottleneck on waiting on the GPU. | ||
|
||
Use Omnitrace to obtain a high-level view of the entire application. Use it | ||
to determine where the performance bottlenecks are and | ||
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel | ||
performance, start your investigation with Omnitrace, which characterizes the | ||
broad picture. | ||
|
||
.. note:: | ||
|
||
For insight into the execution of individual kernels on the GPU, | ||
use `Omniperf <https://github.com/rocm/omniperf>`_. | ||
|
||
In terms of CPU analysis, Omnitrace does not target any specific vendor. | ||
It works just as well on AMD and non-AMD CPUs. | ||
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs | ||
and kernels running on AMD GPUs. |
Oops, something went wrong.