Merge pull request #185 from awslabs/sjg/gpu-docs-dev

Documentation for GPU support
awslabs · Mar 4, 2024 · a16c3bf · a16c3bf
2 parents 71b3813 + a79d79e
commit a16c3bf
Show file tree

Hide file tree

Showing 19 changed files with 80 additions and 2 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -21,6 +21,12 @@ The format of this changelog is based on
   - Added documentation for various timer categories and improved timing breakdown of
     various sections of a simulation.
   - Fixed bug in implementation of numeric wave ports for driven simulations.
+  - Added GPU support for *Palace* via its dependencies, and added the
+    `config["Solver"]["Device"]` and `config["Solver"]["Backend"]` options for runtime
+    configuration of the MFEM device (`"CPU"` or `"GPU"`) and libCEED backend, with suitable
+    defaults for users.
+  - Added a new section to the documentation on
+    [Parallelism and GPU support](https://awslabs.github.io/palace/dev/guide/parallelism/).
 
 ## [0.12.0] - 2023-12-21
 

diff --git a/README.md b/README.md
@@ -39,6 +39,9 @@ the frequency or time domain, using the
     [high-order operator partial assembly](https://mfem.org/performance/), parallel sparse
     direct solvers, and algebraic multigrid (AMG) preconditioners, for fast performance on
     platforms ranging from laptops to HPC systems.
+  - Support for hardware acceleration using NVIDIA or AMD GPUs, including multi-GPU
+    parallelism, using pure CUDA and HIP code as well as [MAGMA](https://icl.utk.edu/magma/)
+    and other libraries.
 
 ## Getting started
 
@@ -62,6 +65,7 @@ System requirements:
   - C and Fortran (optional) compilers for dependency builds
   - MPI distribution
   - BLAS, LAPACK libraries
+  - CUDA Toolkit or ROCm installation (optional, for GPU support only)
 
 ## Documentation
 

diff --git a/docs/make.jl b/docs/make.jl
@@ -23,7 +23,8 @@ makedocs(
             "guide/problem.md",
             "guide/model.md",
             "guide/boundaries.md",
-            "guide/postprocessing.md"
+            "guide/postprocessing.md",
+            "guide/parallelism.md"
         ],
         "Configuration File" => Any[
             "config/config.md",

diff --git a/docs/src/guide/guide.md b/docs/src/guide/guide.md
@@ -14,3 +14,4 @@ which can be performed with *Palace* and the various features available in the s
   - [Simulation Models](model.md)
   - [Boundary Conditions](boundaries.md)
   - [Postprocessing and Visualization](postprocessing.md)
+  - [Parallelism and GPU Support](parallelism.md)
diff --git a/docs/src/guide/parallelism.md b/docs/src/guide/parallelism.md
@@ -0,0 +1,42 @@
+```@raw html
+<!--- Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. --->
+<!--- SPDX-License-Identifier: Apache-2.0 --->
+```
+
+# Parallelism and GPU Support
+
+*Palace* employs multiple types of parallelism in an attempt to maximize performance across
+a wide range of deployment possibilities. The first is MPI-based distributed-memory
+parallelism. This is controlled using the `-np` command line flag as outlined in
+[Running *Palace*](../run.md).
+
+Shared-memory parallelism using OpenMP is also available. To enable this, the
+`-DPALACE_WITH_OPENMP=ON` option should be specified at configure time. At runtime, the
+number of threads is configured with the `-nt` argument to the `palace` executable, or by
+setting the [`OMP_NUM_THREADS`](https://www.openmp.org/spec-html/5.0/openmpse50.html)
+environment variable.
+
+Lastly, *Palace* supports GPU-acceleration using NVIDIA and AMD GPUs, activated with the
+build options `-DPALACE_WITH_CUDA=ON` and `-DPALACE_WITH_HIP=ON`, respectively. At runtime,
+the [`config["Solver"]["Device"]`](../config/solver.md#config%5B%22Solver%22%5D) parameter
+in the configuration file can be set to `"CPU"` (the default) or `"GPU"` in order to
+configure *Palace* and MFEM to use the available GPU(s). The
+[`config["Solver"]["Backend"]`](../config/solver.md#config%5B%22Solver%22%5D) parameter, on
+the other hand, controls the
+[libCEED backend](https://libceed.org/en/latest/gettingstarted/#backends). Users typically
+do not need to provide a value for this option and can instead rely on *Palace*'s default,
+which selects the most appropriate backend for the given value of
+[`config["Solver"]["Device"]`](../config/solver.md#config%5B%22Solver%22%5D).
+
+In order to take full advantage of the performance benefits made available by GPU-
+acceleration, it is recommended to make use of
+[operator partial assembly](https://mfem.org/performance/), activated when the value of
+[`config["Solver"]["PartialAssemblyOrder"]`](../config/solver.md#config%5B%22Solver%22%5D)
+is less than [`config["Solver"]["Order"]`](../config/solver.md#config%5B%22Solver%22%5D).
+This feature avoids assembling a global sparse matrix and instead makes use of data
+structures for operators which lend themselves to more efficient asymptotic storage and
+application costs. See also
+[https://libceed.org/en/latest/intro/](https://libceed.org/en/latest/intro/) for more
+details. Partial assembly in *Palace* supports mixed meshes including both tensor product
+elements (hexahedra and quadrilaterals) as well as non-tensor product elements
+(tetrahedra, prisms, pyramids, and triangles).
diff --git a/docs/src/guide/postprocessing.md b/docs/src/guide/postprocessing.md
@@ -70,7 +70,7 @@ These include:
 ## Boundary postprocessing
 
 Boundary postprocessing capabilities are enabled by including objects under
-`config["Boundaries"]["Postprocessing"]`](../config/boundaries.md) in the configuration
+[`config["Boundaries"]["Postprocessing"]`](../config/boundaries.md) in the configuration
 file. These include:
 
   - [`config["Boundaries"]["Postprocessing"]["Capacitance"]`](../config/boundaries.md#boundaries%5B%22Postprocessing%22%5D%5B%22Capacitance%22%5D) :

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -42,6 +42,10 @@ the frequency or time domain, using the
     [high-order operator partial assembly](https://mfem.org/performance/), parallel sparse
     direct solvers, and algebraic multigrid (AMG) preconditioners, for fast performance on
     platforms ranging from laptops to HPC systems.
+  - Support for
+    [hardware acceleration using NVIDIA or AMD GPUs](https://libceed.org/en/latest/intro/),
+    including multi-GPU parallelism, using pure CUDA and HIP code as well as
+    [MAGMA](https://icl.utk.edu/magma/) and other libraries.
 
 ## Contents
 

diff --git a/docs/src/install.md b/docs/src/install.md
@@ -56,6 +56,9 @@ A build from source requires the following prerequisites installed on your syste
   - C and Fortran (optional) compilers for dependency builds
   - MPI distribution
   - BLAS, LAPACK libraries (described below in [Math libraries](#Math-libraries))
+  - [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) or
+    [ROCm](https://rocm.docs.amd.com/en/latest/) installation (optional, for GPU support
+    only)
 
 In addition, builds from source require the following system packages which are typically
 already installed and are available from most package managers (`apt`, `dnf`, `brew`, etc.):
@@ -101,6 +104,9 @@ The *Palace* build respects standard CMake variables, including:
     desired compilers.
   - `CMAKE_CXX_FLAGS`, `CMAKE_C_FLAGS`, and `CMAKE_Fortran_FLAGS` which define the
     corresponding compiler flags.
+  - `CMAKE_CUDA_COMPILER`, `CMAKE_CUDA_FLAGS`, `CMAKE_CUDA_ARCHITECTURES`, and the
+    corresponding `CMAKE_HIP_COMPILER`, `CMAKE_HIP_FLAGS`, and `CMAKE_HIP_ARCHITECTURES` for
+    GPU-accelerated builds with CUDA or HIP.
   - `CMAKE_INSTALL_PREFIX` which specifies the path for installation (if none is provided,
     defaults to `<BUILD_DIR>`).
   - `CMAKE_BUILD_TYPE` which defines the build type such as `Release`, `Debug`,
@@ -116,6 +122,9 @@ Additional build options are (with default values in brackets):
 
   - `PALACE_WITH_64BIT_INT [OFF]` :  Build with 64-bit integer support
   - `PALACE_WITH_OPENMP [OFF]` :  Use OpenMP for shared-memory parallelism
+  - `PALACE_WITH_CUDA [OFF]` :  Use CUDA for NVIDIA GPU support
+  - `PALACE_WITH_HIP [OFF]` :  Use HIP for AMD or NVIDIA GPU support
+  - `PALACE_WITH_GPU_AWARE_MPI [OFF]` :  Option to set if MPI distribution is GPU aware
   - `PALACE_WITH_SUPERLU [ON]` :  Build with SuperLU_DIST sparse direct solver
   - `PALACE_WITH_STRUMPACK [OFF]` :  Build with STRUMPACK sparse direct solver
   - `PALACE_WITH_MUMPS [OFF]` :  Build with MUMPS sparse direct solver

diff --git a/examples/cavity/cavity_impedance.json b/examples/cavity/cavity_impedance.json
@@ -49,6 +49,7 @@
   "Solver":
   {
     "Order": 4,
+    "Device": "CPU",
     "Eigenmode":
     {
       "N": 15,

diff --git a/examples/cavity/cavity_pec.json b/examples/cavity/cavity_pec.json
@@ -46,6 +46,7 @@
   "Solver":
   {
     "Order": 4,
+    "Device": "CPU",
     "Eigenmode":
     {
       "N": 15,

diff --git a/examples/coaxial/coaxial_matched.json b/examples/coaxial/coaxial_matched.json
@@ -48,6 +48,7 @@
   "Solver":
   {
     "Order": 3,
+    "Device": "CPU",
     "Transient":
     {
       "Type": "GeneralizedAlpha",

diff --git a/examples/coaxial/coaxial_open.json b/examples/coaxial/coaxial_open.json
@@ -46,6 +46,7 @@
   "Solver":
   {
     "Order": 3,
+    "Device": "CPU",
     "Transient":
     {
       "Type": "GeneralizedAlpha",

diff --git a/examples/coaxial/coaxial_short.json b/examples/coaxial/coaxial_short.json
@@ -42,6 +42,7 @@
   "Solver":
   {
     "Order": 3,
+    "Device": "CPU",
     "Transient":
     {
       "Type": "GeneralizedAlpha",

diff --git a/examples/cpw/cpw_lumped_adaptive.json b/examples/cpw/cpw_lumped_adaptive.json
@@ -164,6 +164,7 @@
   "Solver":
   {
     "Order": 2,
+    "Device": "CPU",
     "Driven":
     {
       "MinFreq": 2.0,  // GHz

diff --git a/examples/cpw/cpw_lumped_uniform.json b/examples/cpw/cpw_lumped_uniform.json
@@ -164,6 +164,7 @@
   "Solver":
   {
     "Order": 2,
+    "Device": "CPU",
     "Driven":
     {
       "MinFreq": 2.0,  // GHz

diff --git a/examples/cpw/cpw_wave_adaptive.json b/examples/cpw/cpw_wave_adaptive.json
@@ -128,6 +128,7 @@
   "Solver":
   {
     "Order": 2,
+    "Device": "CPU",
     "Driven":
     {
       "MinFreq": 2.0,  // GHz

diff --git a/examples/cpw/cpw_wave_uniform.json b/examples/cpw/cpw_wave_uniform.json
@@ -128,6 +128,7 @@
   "Solver":
   {
     "Order": 2,
+    "Device": "CPU",
     "Driven":
     {
       "MinFreq": 2.0,  // GHz

diff --git a/examples/rings/rings.json b/examples/rings/rings.json
@@ -78,6 +78,7 @@
   "Solver":
   {
     "Order": 2,
+    "Device": "CPU",
     "Magnetostatic":
     {
       "Save": 2

diff --git a/examples/spheres/spheres.json b/examples/spheres/spheres.json
@@ -74,6 +74,7 @@
   "Solver":
   {
     "Order": 3,
+    "Device": "CPU",
     "Electrostatic":
     {
       "Save": 2