Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support problematic inside singularity container #231

Closed
ocaisa opened this issue Feb 17, 2023 · 13 comments
Closed

GPU support problematic inside singularity container #231

ocaisa opened this issue Feb 17, 2023 · 13 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented Feb 17, 2023

The approach for GPU support seems to work fine natively but however singularity supports CUDA, it seems to break down there. If you use the flag --nv to add CUDA support, you cannot actually see libcuda.so and friends inside the container. For us this is bad, as we need to see the actual libraries. This manifests as problems like:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
@ocaisa
Copy link
Member Author

ocaisa commented Feb 17, 2023

The root of this is probably because singularity places the GPU libraries under /.singularity.d/libs inside the container.

Adding that location to LD_LIBRARY_PATH fixes the issue with nvidia-smi inside the container, but does not seem to fix it for things compiled with nvcc

@ocaisa
Copy link
Member Author

ocaisa commented Feb 24, 2023

Singularity actually puts /.singularity.d/libs in the LD_LIBRARY_PATH when initialising which is why it works

@ocaisa
Copy link
Member Author

ocaisa commented Feb 24, 2023

Dropping into a prefix shell clears all the envvars which is why things break

@ocaisa
Copy link
Member Author

ocaisa commented Feb 24, 2023

That the CUDA libs are not in a standard path is probably why things break. Since LD_LIBRARY_PATH is set, it will be the preferred location to find libcuda.so (since it is not rpath-ed), but that is not what we want as we need the compat libraries libcuda.so. Manually fixing this by putting the compat libraries first in LD_LIBRARY_PATH does not seem to fix this though, so I suspect that not having the GPU drivers in a standard location is causing issues.

@ocaisa
Copy link
Member Author

ocaisa commented May 9, 2023

So I have a "workaround" for this. We can update LD_LIBRARY_PATH when we are inside a container:

export LD_LIBRARY_PATH=/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/compat/:/.singularity.d/libs

This means it prefers the compatibility libraries we are trying to install, and it will at least get through the installation process.

It's not ideal, since we cannot check the native case where LD_LIBRARY_PATH is not set...and this setting would also need to be added to out initialisation scripts in general to ensure we always have priority in a container scenario.

@ocaisa
Copy link
Member Author

ocaisa commented May 9, 2023

The problem with this is that a user would need to have that variable set by default in their container or running an executable from inside the container may not work.

There is a trick we could use but I'm not sure it is a good idea...We could add /.singularity.d/libs to the default search paths of our linker (after our host_injections directory). We need to check if a similar approach is used by other container environments.

@boegel
Copy link
Contributor

boegel commented May 9, 2023

Apptainer seems to have pretty good docs on their approach?
https://apptainer.org/docs/user/main/gpu.html#nvidia-gpus-cuda-standard

@ocaisa
Copy link
Member Author

ocaisa commented May 9, 2023

Yes, they use --nv with the LD_LIBRARY_PATH pointing to /.singularity.d/libs. The difficulty is we want to override that so that we can "upgrade" the CUDA support natively available on the system.

@ocaisa
Copy link
Member Author

ocaisa commented May 10, 2023

I'll summarise a long discussion on Slack about the issue here and how to tackle it (with liberal quoting from discussion contributors).

apptainer/singularity, in order to access libcuda.so and other libraries from the host, does the following:

  • run ldconfig -p
  • scan the output for any known nvidia libraries, e.g. in our case libcuda.so.1 (libc6,x86-64) => /usr/lib64/nvidia/libcuda.so.1
  • it will then, if you use the --nv option, bind mount to those libraries in /.singularity.d/libs/ inside the container and automatically set LD_LIBRARY_PATH=/.singularity.d/libs

This way CUDA-using apps inside the container can use the libcuda.so etc. from the host.

Now, for EESSI, this is a bit of an issue. We want to use the CUDA compatibility libraries to "upgrade" the available CUDA version. Due to the use of LD_LIBRARY_PATH, the only option to do that is to prepend the location of the compatibility libraries to LD_LIBRARY_PATH to ensure they take precedence.

If LD_LIBRARY_PATH was not set, we can tweak our own linker to resolve this issue. If you look at how the linker resolves shared libraries the precedence order is:

  1. RPATH
  2. LD_LIBRARY_PATH
  3. RUNPATH
  4. ld cache
  5. trusted libc directory

So, if we add /.singularity.d/libs and our CUDA compat libraries location (under host_injections) to the ld cache (with host_injections appearing first), the EESSI linker will resolve things correctly. Unfortunately, the default use of LD_LIBRARY_PATH throws a spanner in the works as, if it is still set in the environment, this means the host libraries again take precedence. (Also note, without LD_LIBRARY_PATH tools like nvidia-smi will stop working since they use the system linker not the EESSI linker, but you can force it to use the EESSI linker with $EPREFIX/lib/ld-linux-x86-64.so.2 /path/to/nvidia-smi).

This gets messier when we think of what might happen inside a container shell. If a user chooses to start a prefix shell inside the container, LD_LIBRARY_PATH gets cleaned out and the GPU stops working unless we take the ld cache approach.

Now that we've outlined the problems, what is the solution? So let's say we just take the ld cache approach and ignore the use of LD_LIBRARY_PATH. Well, the reality is that if the CUDA driver on the host is recent enough things will "just work". If it is not recent enough then we enter the area where the end user will need to do integrate the CUDA compatibility libraries into the container. At this point, we can document what they need to do, they would need to:

  • bind mount the host_injections directory,
  • enter a shell in the container and run our GPU support script,
  • tweak the default value of LD_LIBRARY_PATH of the container so that the compat libraries are picked up first.

Given that we will have to document what people need to do in a container environment if they need to update the compat libraries anyway, I'm of the mind that we should just use the ld cache approach. It'll always work for a native environment, and work the majority of the time out-of-the-box in a container. For the container cases where it doesn't work (because the host CUDA driver is too old), we document how to get around it. Change my mind!

@boegel
Copy link
Contributor

boegel commented May 10, 2023

@ocaisa Thanks for documenting this nicely here.

I agree with your view: if it works out of the box for the most common cases (native EESSI, or CUDA driver is recent enough), then it's OK. I consider the container use case already a special one, especially in the long term.

We should make the documentation that explains how to fix problems if they arise easy to find, by including errors messages that may pop up.

@terjekv
Copy link
Member

terjekv commented May 10, 2023

Okay, first of all, amazing breakdown. This makes it as clear as possible, probably, to see the complexities involved. I would lean towards the ld cache approach as well. It is much much cleaner than trying to solve every scenario with containers and drivers.

I do fear how we document the solution if problems arise. Anything detailed enough to be useful will be very long and full of clauses, and that may cause people to just close the page and give up. But, it shouldn't be needed most of the time.

@trz42
Copy link
Collaborator

trz42 commented May 10, 2023

Nice writeup @ocaisa. Could we develop (ReFrame) tests for all possible cases and run them (regularly) OR tell users please run them to verify that your environment is well-configured?

@boegel boegel modified the milestone: 2023Q1_GPU May 17, 2023
@ocaisa
Copy link
Member Author

ocaisa commented Dec 21, 2023

GPU support implemented with #434

Linking situation is resolved for now, but we need more trusted directories in future iterations of EESSI (to make it easier to update drivers for different accelerators)

@ocaisa ocaisa closed this as completed Dec 21, 2023
TopRichard added a commit to TopRichard/bot-software-layer1 that referenced this issue Jan 19, 2024
…ync-test+licenses

Another step of syncing with EESSI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants