Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update GPU docs about cuDNN and new script to install full SDKs #328

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 38 additions & 25 deletions docs/site_specific_config/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,27 +29,38 @@ For CUDA-enabled software to run, it needs to be able to find the **NVIDIA GPU d
The challenge here is that the NVIDIA GPU drivers are not _always_ in a standard system location, and that we
can not install the GPU drivers in EESSI (since they are too closely tied to the client OS and GPU hardware).

### Compiling CUDA software {: #cuda_sdk }
### Compiling software on top of CUDA, cuDNN and other SDKs provided by NVIDIA {: #cuda_sdk }

An additional requirement is necessary if you want to be able to compile CUDA-enabled software using a CUDA installation included in EESSI. This requires a *full* CUDA SDK, but the [CUDA SDK End User License Agreement (EULA)](https://docs.nvidia.com/cuda/eula/index.html) does not allow for full redistribution. In EESSI, we are (currently) only allowed to redistribute the files needed to *run* CUDA software.
An additional requirement is necessary if you want to be able to compile software
that makes use of a CUDA installation or cu\* SDKs (e.g., cuDNN) included in
EESSI. This requires a *full* installation of the CUDA SDK, cuDNN, etc. However,
the [CUDA SDK End User License Agreement (EULA)](https://docs.nvidia.com/cuda/eula/index.html)
and the [Software License Agreement (SLA) for NVIDIA cuDNN](https://docs.nvidia.com/deeplearning/cudnn/latest/reference/eula.html)
do not allow for full redistribution. In EESSI, we are (currently) only allowed to
redistribute the files needed to *run* CUDA and cuDNN software.

!!! note "Full CUDA SDK only needed to *compile* CUDA software"
Without a full CUDA SDK on the host system, you will still be able to *run* CUDA-enabled software from the EESSI stack,
you just won't be able to *compile* additional CUDA software.
!!! note "Full CUDA SDK or cuDNN SDK only needed to *compile* CUDA or cuDNN software"
Without a full CUDA SDK or cuDNN SDK on the host system, you will still
be able to *run* CUDA-enabled or cuDNN-enabled software from the EESSI stack,
you just won't be able to *compile* additional CUDA or cuDNN software.

Below, we describe how to make sure that the EESSI software stack can find your NVIDIA GPU drivers and (optionally) full installations of the CUDA SDK.
Below, we describe how to make sure that the EESSI software stack can find your
NVIDIA GPU drivers and (optionally) full installations of the CUDA SDK and the
cuDNN SDK.

### Configuring CUDA driver location {: #driver_location }

All CUDA-enabled software in EESSI expects the CUDA drivers to be available in a specific subdirectory of this `host_injections` directory.
In addition, installations of the CUDA SDK included EESSI are stripped down to the files that we are allowed to redistribute;
In addition, installations of the CUDA SDK and cuDNN SDK included EESSI are stripped down to the files that we are allowed to redistribute;
all other files are replaced by symbolic links that point to another specific subdirectory of `host_injections`. For example:
```
$ ls -l /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
lrwxrwxrwx 1 cvmfs cvmfs 109 Dec 21 14:49 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc -> /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
```

If the corresponding full installation of the CUDA SDK is available there, the CUDA installation included in EESSI can be used to build CUDA software.
If the corresponding full installation of the CUDA SDK is available there, the
CUDA installation included in EESSI can be used to build CUDA software. The same
applies to the cuDNN SDK.


### Using NVIDIA GPUs via a native EESSI installation {: #nvidia_eessi_native }
Expand All @@ -74,37 +85,39 @@ This script uses `ldconfig` on your host system to locate your GPU drivers, and

Note that it is safe to re-run the script even if no driver updates were done: the script should detect that the current version of the drivers were already symlinked.

#### Installing full CUDA SDK (optional)
#### Installing full CUDA SDK and cuDNN SDK (optional)

To install a full CUDA SDK under `host_injections`, use the `install_cuda_host_injections.sh` script that is included in EESSI:
To install a full CUDA SDK and cuDNN SDK under `host_injections`, use the `install_cuda_and_libraries.sh` script that is included in EESSI:

```{ .bash .copy }
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_host_injections.sh
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
```

For example, to install CUDA 12.1.1 in the directory that the [`host_injections` variant symlink](host_injections.md) points to,
For example, to install CUDA 12.1.1 and cuDNN 8.9.2.26 in the directory that the [`host_injections` variant symlink](host_injections.md) points to,
using `/tmp/$USER/EESSI` as directory to store temporary files:
```
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_host_injections.sh --cuda-version 12.1.1 --temp-dir /tmp/$USER/EESSI --accept-cuda-eula
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh --temp-dir /tmp/$USER/EESSI --accept-cuda-eula --accept-cudnn-eula
```
You should choose the CUDA version you wish to install according to what CUDA versions are included in EESSI;
see the output of `module avail CUDA/` after [setting up your environment for using
EESSI](../using_eessi/setting_up_environment.md).
The versions 12.1.1 for CUDA and 8.9.2.26 for cuDNN are defined in an easystack
file that is also included in EESSI:
```
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/easystacks/eessi-2023.06-eb-4.9.4-2023a-CUDA-host-injections.yml
```
By default, the install script processes all files matching `eessi-*CUDA*.yml` in
the above `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/easystacks` directory.

You can run `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_host_injections.sh --help` to check all of the options.
You can run `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh --help` to check all of the options.

!!! tip

This script uses EasyBuild to install the CUDA SDK. For this to work, two requirements need to be satisfied:

* `module load EasyBuild` should work (or the `eb` command is already available in the environment);
* The version of EasyBuild being used should provide the requested version of the CUDA easyconfig file
(in the example case above, that's `CUDA-12.1.1.eb`).
This script uses EasyBuild to install the CUDA SDK and the cuDNN SDK. For this to work, two requirements need to be satisfied:

You can rely on the EasyBuild installation that is included in EESSI for this.
* `module load EasyBuild/${EB_VERSION}` must work (EB_VERSION is extracted
from the name of the easystack file (e.g., from `eb-4.9.4` EB_VERSION is
derived as 4.9.4);
* `module load EESSI-extend/${EESSI_VERSION}-easybuild` must work.

Alternatively, you may load an EasyBuild module manually _before_ running the `install_cuda_host_injections.sh`
script to make an `eb` command available.
Both modules are included in EESSI.


### Using NVIDIA GPUs via EESSI in a container {: #nvidia_eessi_container }
Expand Down
Loading