Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for GPU support #185

Merged
merged 6 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ The format of this changelog is based on
- Added documentation for various timer categories and improved timing breakdown of
various sections of a simulation.
- Fixed bug in implementation of numeric wave ports for driven simulations.
- Added GPU support for *Palace* via its dependencies, and added the
`config["Solver"]["Device"]` and `config["Solver"]["Backend"]` options for runtime
configuration of the MFEM device (`"CPU"` or `"GPU"`) and libCEED backend, with suitable
defaults for users.
- Added a new section to the documentation on
[Parallelism and GPU support](https://awslabs.github.io/palace/dev/guide/parallelism/).

## [0.12.0] - 2023-12-21

Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ the frequency or time domain, using the
[high-order operator partial assembly](https://mfem.org/performance/), parallel sparse
direct solvers, and algebraic multigrid (AMG) preconditioners, for fast performance on
platforms ranging from laptops to HPC systems.
- Support for hardware acceleration using NVIDIA or AMD GPUs, including multi-GPU
parallelism, using pure CUDA and HIP code as well as [MAGMA](https://icl.utk.edu/magma/)
and other libraries.

## Getting started

Expand All @@ -62,6 +65,7 @@ System requirements:
- C and Fortran (optional) compilers for dependency builds
- MPI distribution
- BLAS, LAPACK libraries
- CUDA Toolkit or ROCm installation (optional, for GPU support only)

## Documentation

Expand Down
3 changes: 2 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ makedocs(
"guide/problem.md",
"guide/model.md",
"guide/boundaries.md",
"guide/postprocessing.md"
"guide/postprocessing.md",
"guide/parallelism.md"
],
"Configuration File" => Any[
"config/config.md",
Expand Down
1 change: 1 addition & 0 deletions docs/src/guide/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ which can be performed with *Palace* and the various features available in the s
- [Simulation Models](model.md)
- [Boundary Conditions](boundaries.md)
- [Postprocessing and Visualization](postprocessing.md)
- [Parallelism and GPU Support](parallelism.md)
42 changes: 42 additions & 0 deletions docs/src/guide/parallelism.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
```@raw html
<!--- Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. --->
<!--- SPDX-License-Identifier: Apache-2.0 --->
```

# Parallelism and GPU Support

*Palace* employs multiple types of parallelism in an attempt to maximize performance across
a wide range of deployment possibilities. The first is MPI-based distributed-memory
parallelism. This is controlled using the `-np` command line flag as outlined in
[Running *Palace*](../run.md).

Shared-memory parallelism using OpenMP is also available. To enable this, the
`-DPALACE_WITH_OPENMP=ON` option should be specified at configure time. At runtime, the
number of threads is configured with the `-nt` argument to the `palace` executable, or by
setting the [`OMP_NUM_THREADS`](https://www.openmp.org/spec-html/5.0/openmpse50.html)
environment variable.

Lastly, *Palace* supports GPU-acceleration using NVIDIA and AMD GPUs, activated with the
build options `-DPALACE_WITH_CUDA=ON` and `-DPALACE_WITH_HIP=ON`, respectively. At runtime,
the [`config["Solver"]["Device"]`](../config/solver.md#config%5B%22Solver%22%5D) parameter
in the configuration file can be set to `"CPU"` (the default) or `"GPU"` in order to
configure *Palace* and MFEM to use the available GPU(s). The
[`config["Solver"]["Backend"]`](../config/solver.md#config%5B%22Solver%22%5D) parameter, on
the other hand, controls the
[libCEED backend](https://libceed.org/en/latest/gettingstarted/#backends). Users typically
do not need to provide a value for this option and can instead rely on *Palace*'s default,
which selects the most appropriate backend for the given value of
[`config["Solver"]["Device"]`](../config/solver.md#config%5B%22Solver%22%5D).

In order to take full advantage of the performance benefits made available by GPU-
acceleration, it is recommended to make use of
[operator partial assembly](https://mfem.org/performance/), activated when the value of
[`config["Solver"]["PartialAssemblyOrder"]`](../config/solver.md#config%5B%22Solver%22%5D)
is less than [`config["Solver"]["Order"]`](../config/solver.md#config%5B%22Solver%22%5D).
This feature avoids assembling a global sparse matrix and instead makes use of data
structures for operators which lend themselves to more efficient asymptotic storage and
application costs. See also
[https://libceed.org/en/latest/intro/](https://libceed.org/en/latest/intro/) for more
details. Partial assembly in *Palace* supports mixed meshes including both tensor product
elements (hexahedra and quadrilaterals) as well as non-tensor product elements
(tetrahedra, prisms, pyramids, and triangles).
2 changes: 1 addition & 1 deletion docs/src/guide/postprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ These include:
## Boundary postprocessing

Boundary postprocessing capabilities are enabled by including objects under
`config["Boundaries"]["Postprocessing"]`](../config/boundaries.md) in the configuration
[`config["Boundaries"]["Postprocessing"]`](../config/boundaries.md) in the configuration
file. These include:

- [`config["Boundaries"]["Postprocessing"]["Capacitance"]`](../config/boundaries.md#boundaries%5B%22Postprocessing%22%5D%5B%22Capacitance%22%5D) :
Expand Down
4 changes: 4 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ the frequency or time domain, using the
[high-order operator partial assembly](https://mfem.org/performance/), parallel sparse
direct solvers, and algebraic multigrid (AMG) preconditioners, for fast performance on
platforms ranging from laptops to HPC systems.
- Support for
[hardware acceleration using NVIDIA or AMD GPUs](https://libceed.org/en/latest/intro/),
including multi-GPU parallelism, using pure CUDA and HIP code as well as
[MAGMA](https://icl.utk.edu/magma/) and other libraries.

## Contents

Expand Down
9 changes: 9 additions & 0 deletions docs/src/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ A build from source requires the following prerequisites installed on your syste
- C and Fortran (optional) compilers for dependency builds
- MPI distribution
- BLAS, LAPACK libraries (described below in [Math libraries](#Math-libraries))
- [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) or
[ROCm](https://rocm.docs.amd.com/en/latest/) installation (optional, for GPU support
only)

In addition, builds from source require the following system packages which are typically
already installed and are available from most package managers (`apt`, `dnf`, `brew`, etc.):
Expand Down Expand Up @@ -101,6 +104,9 @@ The *Palace* build respects standard CMake variables, including:
desired compilers.
- `CMAKE_CXX_FLAGS`, `CMAKE_C_FLAGS`, and `CMAKE_Fortran_FLAGS` which define the
corresponding compiler flags.
- `CMAKE_CUDA_COMPILER`, `CMAKE_CUDA_FLAGS`, `CMAKE_CUDA_ARCHITECTURES`, and the
corresponding `CMAKE_HIP_COMPILER`, `CMAKE_HIP_FLAGS`, and `CMAKE_HIP_ARCHITECTURES` for
GPU-accelerated builds with CUDA or HIP.
- `CMAKE_INSTALL_PREFIX` which specifies the path for installation (if none is provided,
defaults to `<BUILD_DIR>`).
- `CMAKE_BUILD_TYPE` which defines the build type such as `Release`, `Debug`,
Expand All @@ -116,6 +122,9 @@ Additional build options are (with default values in brackets):

- `PALACE_WITH_64BIT_INT [OFF]` : Build with 64-bit integer support
- `PALACE_WITH_OPENMP [OFF]` : Use OpenMP for shared-memory parallelism
- `PALACE_WITH_CUDA [OFF]` : Use CUDA for NVIDIA GPU support
- `PALACE_WITH_HIP [OFF]` : Use HIP for AMD or NVIDIA GPU support
- `PALACE_WITH_GPU_AWARE_MPI [OFF]` : Option to set if MPI distribution is GPU aware
- `PALACE_WITH_SUPERLU [ON]` : Build with SuperLU_DIST sparse direct solver
- `PALACE_WITH_STRUMPACK [OFF]` : Build with STRUMPACK sparse direct solver
- `PALACE_WITH_MUMPS [OFF]` : Build with MUMPS sparse direct solver
Expand Down
1 change: 1 addition & 0 deletions examples/cavity/cavity_impedance.json
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
"Solver":
{
"Order": 4,
"Device": "CPU",
"Eigenmode":
{
"N": 15,
Expand Down
1 change: 1 addition & 0 deletions examples/cavity/cavity_pec.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"Solver":
{
"Order": 4,
"Device": "CPU",
"Eigenmode":
{
"N": 15,
Expand Down
1 change: 1 addition & 0 deletions examples/coaxial/coaxial_matched.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
"Solver":
{
"Order": 3,
"Device": "CPU",
"Transient":
{
"Type": "GeneralizedAlpha",
Expand Down
1 change: 1 addition & 0 deletions examples/coaxial/coaxial_open.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"Solver":
{
"Order": 3,
"Device": "CPU",
"Transient":
{
"Type": "GeneralizedAlpha",
Expand Down
1 change: 1 addition & 0 deletions examples/coaxial/coaxial_short.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"Solver":
{
"Order": 3,
"Device": "CPU",
"Transient":
{
"Type": "GeneralizedAlpha",
Expand Down
1 change: 1 addition & 0 deletions examples/cpw/cpw_lumped_adaptive.json
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
"Solver":
{
"Order": 2,
"Device": "CPU",
"Driven":
{
"MinFreq": 2.0, // GHz
Expand Down
1 change: 1 addition & 0 deletions examples/cpw/cpw_lumped_uniform.json
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@
"Solver":
{
"Order": 2,
"Device": "CPU",
"Driven":
{
"MinFreq": 2.0, // GHz
Expand Down
1 change: 1 addition & 0 deletions examples/cpw/cpw_wave_adaptive.json
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
"Solver":
{
"Order": 2,
"Device": "CPU",
"Driven":
{
"MinFreq": 2.0, // GHz
Expand Down
1 change: 1 addition & 0 deletions examples/cpw/cpw_wave_uniform.json
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
"Solver":
{
"Order": 2,
"Device": "CPU",
"Driven":
{
"MinFreq": 2.0, // GHz
Expand Down
1 change: 1 addition & 0 deletions examples/rings/rings.json
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@
"Solver":
{
"Order": 2,
"Device": "CPU",
"Magnetostatic":
{
"Save": 2
Expand Down
1 change: 1 addition & 0 deletions examples/spheres/spheres.json
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
"Solver":
{
"Order": 3,
"Device": "CPU",
"Electrostatic":
{
"Save": 2
Expand Down