Merge branch 'ROCm:main' into main

ROCm · Jul 30, 2024 · 659104c · 659104c
2 parents bc1fad2 + db79905
commit 659104c
Show file tree

Hide file tree

Showing 38 changed files with 7,122 additions and 9 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -4,3 +4,4 @@
 docs/* @ROCm/rocm-documentation
 *.md @ROCm/rocm-documentation
 *.rst @ROCm/rocm-documentation
+.readthedocs.yaml @ROCm/rocm-documentation
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -9,3 +9,14 @@ updates:
     directory: "/" # Location of package manifests
     schedule:
       interval: "weekly"
+
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/docs/sphinx" # Location of package manifests
+    open-pull-requests-limit: 10
+    schedule:
+      interval: "daily"
+    labels:
+      - "documentation"
+      - "dependencies"
+    reviewers:
+      - "samjwu"
diff --git a/.gitignore b/.gitignore
@@ -37,6 +37,10 @@
 # Python cache files
 *.pyc
 
+# Documentation artifacts
+/_build
+_toc.yml
+
 /build*
 /.vscode
 /.cache

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,18 @@
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.10"
+
+python:
+  install:
+  - requirements: docs/sphinx/requirements.txt
+
+sphinx:
+  configuration: docs/conf.py
+
+formats: []
diff --git a/README.md b/README.md
@@ -7,8 +7,6 @@
 [![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
 [![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)
 
-> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
-
 ## Overview
 
 AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
@@ -86,8 +84,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
 
 ## Documentation
 
-The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
-See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
+The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
+See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.
 
 ## Quick Start
 
@@ -108,7 +106,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
 python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
 ```
 
-See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
+See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.
 
 ### Setup
 
@@ -297,13 +295,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
 - Select "Open trace file" from panel on the left
 - Locate the omnitrace perfetto output (extension: `.proto`)
 
-![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
+![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)
 
-![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
+![omnitrace-rocm](docs/data/omnitrace-rocm.png)
 
-![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
+![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)
 
-![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
+![omnitrace-user-api](docs/data/omnitrace-user-api.png)
 
 ## Using Perfetto tracing with System Backend
 

diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1,2 @@
+_build/
+_doxygen/
diff --git a/docs/conceptual/data-collection-modes.rst b/docs/conceptual/data-collection-modes.rst
@@ -0,0 +1,146 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+**********************
+Data collection modes
+**********************
+
+Omnitrace supports several modes of recording trace and profiling data for your application.
+
+.. note::
+
+   For an explanation of the terms used in this topic, see 
+   the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
+
++-----------------------------+---------------------------------------------------------+
+| Mode                        | Description                                             |
++=============================+=========================================================+
+| Binary Instrumentation      | Locates functions (and loops, if desired) in the binary |
+|                             | and inserts snippets at the entry and exit              |
++-----------------------------+---------------------------------------------------------+
+| Statistical Sampling        | Periodically pauses application at specified intervals  |
+|                             | and records various metrics for the given call stack    |
++-----------------------------+---------------------------------------------------------+
+| Callback APIs               | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
+|                             | make callbacks into Omnitrace to provide information    |
+|                             | about the work the API is performing                    |
++-----------------------------+---------------------------------------------------------+
+| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
+|                             | dynamic library/executable, like ``pthread_mutex_lock`` |
+|                             | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
++-----------------------------+---------------------------------------------------------+
+| User API                    | User-defined regions and controls for Omnitrace         |
++-----------------------------+---------------------------------------------------------+
+
+The two most generic and important modes are binary instrumentation and statistical sampling. 
+It is important to understand their advantages and disadvantages.
+Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument`` 
+executable. For statistical sampling, it's highly recommended to use the
+``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed. 
+Callback APIs and dynamic symbol interception can be utilized with either tool.
+
+Binary instrumentation
+-----------------------------------
+
+Binary instrumentation lets you record deterministic measurements for 
+every single invocation of a given function.
+Binary instrumentation effectively adds instructions to the target application to 
+collect the required information. It therefore has the potential to cause performance 
+changes which might, in some cases, lead to inaccurate results. The effect depends on 
+the information being collected and which features are activated in Omnitrace. 
+For example, collecting only the wall-clock timing data
+has less of an effect than collecting the wall-clock timing, CPU-clock timing, 
+memory usage, cache-misses, and number of instructions that were run. Similarly, 
+collecting a flat profile has less overhead than a hierarchical profile 
+and collecting a trace OR a profile has less overhead than collecting a 
+trace AND a profile.
+
+In Omnitrace, the primary heuristic for controlling the overhead with binary 
+instrumentation is the minimum number of instructions for selecting functions 
+for instrumentation.
+
+Statistical sampling
+-----------------------------------
+
+Statistical call-stack sampling periodically interrupts the application at 
+regular intervals using operating system interrupts.
+Sampling is typically less numerically accurate and specific, but the 
+target program runs at nearly full speed.
+In contrast to the data derived from binary instrumentation, the resulting 
+data is not exact but is instead a statistical approximation.
+However, sampling often provides a more accurate picture of the application 
+execution because it is less intrusive to the target application and has fewer
+side effects on memory caches or instruction decoding pipelines. Furthermore, 
+because sampling does not affect the execution speed as much, is it
+relatively immune to over-evaluating the cost of small, frequently called 
+functions or "tight" loops.
+
+In Omnitrace, the overhead for statistical sampling depends on the 
+sampling rate and whether the samples are taken with respect to the CPU time 
+and/or real time.
+
+Binary instrumentation vs. statistical sampling example
+-------------------------------------------------------
+
+Consider the following code:
+
+.. code-block:: c++
+
+   long fib(long n)
+   {
+        if(n < 2) return n;
+        return fib(n - 1) + fib(n - 2);
+   }
+
+   void run(long n)
+   {
+        long result = fib(n);
+        printf("[%li] fibonacci(%li) = %li\n", i, n, result);
+   }
+
+   int main(int argc, char** argv)
+   {
+        long nfib = 30;
+        long nitr = 10;
+        if(argc > 1) nfib = atol(argv[1]);
+        if(argc > 2) nitr = atol(argv[2]);
+
+        for(long i = 0; i < nitr; ++i)
+            run(nfib);
+
+        return 0;
+   }
+
+Binary instrumentation of the ``fib`` function will record **every single invocation** 
+of the function. For a very small function
+such as ``fib``, this results in **significant** overhead since this simple function 
+takes about 20 instructions, whereas the entry and
+exit snippets are ~1024 instructions. Therefore, you generally want to avoid 
+instrumenting functions where the instrumented function has significantly fewer
+instructions than entry and exit instrumentation. (Note that many of the 
+instructions in entry and exit functions are either logging functions or
+depend on the runtime settings and thus might never run). However, 
+due to the number of potential instructions in the entry and exit snippets,
+the default behavior of ``omnitrace-instrument`` is to only instrument functions 
+which contain fewer than 1024 instructions.
+
+However, recording every single invocation of the function can be extremely 
+useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
+than the average or a high standard deviation. In this case, the traces help you 
+identify exactly when and where those instances deviated from the norm.
+Compare the level of detail in the following traces. In the top image, 
+every instance of the ``fib`` function is instrumented, while in the bottom image,
+the ``fib`` call-stack is derived via sampling.
+
+Binary instrumentation of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-instrumented.png
+   :alt: Visualization of the output of a binary instrumentation of the Fibonacci function
+
+Statistical sampling of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-sampling.png
+   :alt: Visualization of the output of a statistical sample of the Fibonacci function
diff --git a/docs/conceptual/omnitrace-feature-set.rst b/docs/conceptual/omnitrace-feature-set.rst
@@ -0,0 +1,137 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+***************************************
+The Omnitrace feature set and use cases
+***************************************
+
+`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible. 
+Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_ 
+to manage extensions, resources, data, and other items. It supports the following features, 
+modes, metrics, and APIs.
+
+Data collection modes
+========================================
+
+* Dynamic instrumentation
+
+  * Runtime instrumentation: Instrument executables and shared libraries at runtime
+  * Binary rewriting: Generate a new executable and/or library with instrumentation built-in
+
+* Statistical sampling: Periodic software interrupts per-thread
+* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
+* Causal profiling: Quantifies the potential impact of optimizations in parallel code
+
+.. note::
+
+   Critical trace support was removed in Omnitrace v1.11.0. 
+   It was replaced by the causal profiling feature.
+
+Data analysis
+========================================
+
+* High-level summary profiles with mean, min, max, and standard deviation statistics
+
+  * Low overhead and memory efficient
+  * Ideal for running at scale
+
+* Comprehensive traces for every individual event and measurement
+* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
+
+Parallelism API support
+========================================
+
+* HIP
+* HSA
+* Pthreads
+* MPI
+* Kokkos-Tools (KokkosP)
+* OpenMP-Tools (OMPT)
+
+GPU metrics
+========================================
+
+* GPU hardware counters
+* HIP API tracing
+* HIP kernel tracing
+* HSA API tracing
+* HSA operation tracing
+* System-level sampling (via rocm-smi)
+
+  * Memory usage
+  * Power usage
+  * Temperature
+  * Utilization
+
+CPU metrics
+========================================
+
+* CPU hardware counters sampling and profiles
+* CPU frequency sampling
+* Various timing metrics
+
+  * Wall time
+  * CPU time (process and thread)
+  * CPU utilization (process and thread)
+  * User CPU time
+  * Kernel CPU time
+
+* Various memory metrics
+
+  * High-water mark (sampling and profiles)
+  * Memory page allocation
+  * Virtual memory usage
+
+* Network statistics
+* I/O metrics
+* Many others
+
+Third-party API support
+========================================
+
+* TAU
+* LIKWID
+* Caliper
+* CrayPAT
+* VTune
+* NVTX
+* ROCTX
+
+Omnitrace use cases
+========================================
+
+When analyzing the performance of an application, do NOT 
+assume you know where the performance bottlenecks are
+and why they are happening. Omnitrace is a tool for analyzing the entire 
+application and its performance. It is
+ideal for characterizing where optimization would have the greatest impact 
+on an end-to-end run of the application and for
+viewing what else is happening on the system during a performance bottleneck.
+
+When GPUs are involved, there is a tendency to assume that 
+the quickest path to performance improvement is minimizing
+the runtime of the GPU kernels. This is a highly flawed assumption. 
+If you optimize the runtime of a kernel from one millisecond
+to 1 microsecond (1000x speed-up) but the original application never 
+spent time waiting for kernels to complete,
+there would be no statistically significant reduction in the end-to-end 
+runtime of your application. In other words, it does not matter
+how fast or slow the code on GPU is if the application has a  
+bottleneck on waiting on the GPU.
+
+Use Omnitrace to obtain a high-level view of the entire application. Use it 
+to determine where the performance bottlenecks are and
+obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
+performance, start your investigation with Omnitrace, which characterizes the
+broad picture.
+
+.. note::
+
+   For insight into the execution of individual kernels on the GPU, 
+   use `Omniperf <https://github.com/rocm/omniperf>`_.
+
+In terms of CPU analysis, Omnitrace does not target any specific vendor. 
+It works just as well on AMD and non-AMD CPUs.
+With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs 
+and kernels running on AMD GPUs.