-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support problematic inside singularity container #231
Comments
The root of this is probably because Adding that location to |
Singularity actually puts |
Dropping into a prefix shell clears all the envvars which is why things break |
That the CUDA libs are not in a standard path is probably why things break. Since |
So I have a "workaround" for this. We can update
This means it prefers the compatibility libraries we are trying to install, and it will at least get through the installation process. It's not ideal, since we cannot check the native case where |
The problem with this is that a user would need to have that variable set by default in their container or running an executable from inside the container may not work. There is a trick we could use but I'm not sure it is a good idea...We could add |
Apptainer seems to have pretty good docs on their approach? |
Yes, they use |
I'll summarise a long discussion on Slack about the issue here and how to tackle it (with liberal quoting from discussion contributors).
This way CUDA-using apps inside the container can use the Now, for EESSI, this is a bit of an issue. We want to use the CUDA compatibility libraries to "upgrade" the available CUDA version. Due to the use of If
So, if we add This gets messier when we think of what might happen inside a container shell. If a user chooses to start a prefix shell inside the container, Now that we've outlined the problems, what is the solution? So let's say we just take the ld cache approach and ignore the use of
Given that we will have to document what people need to do in a container environment if they need to update the compat libraries anyway, I'm of the mind that we should just use the ld cache approach. It'll always work for a native environment, and work the majority of the time out-of-the-box in a container. For the container cases where it doesn't work (because the host CUDA driver is too old), we document how to get around it. Change my mind! |
@ocaisa Thanks for documenting this nicely here. I agree with your view: if it works out of the box for the most common cases (native EESSI, or CUDA driver is recent enough), then it's OK. I consider the container use case already a special one, especially in the long term. We should make the documentation that explains how to fix problems if they arise easy to find, by including errors messages that may pop up. |
Okay, first of all, amazing breakdown. This makes it as clear as possible, probably, to see the complexities involved. I would lean towards the ld cache approach as well. It is much much cleaner than trying to solve every scenario with containers and drivers. I do fear how we document the solution if problems arise. Anything detailed enough to be useful will be very long and full of clauses, and that may cause people to just close the page and give up. But, it shouldn't be needed most of the time. |
Nice writeup @ocaisa. Could we develop (ReFrame) tests for all possible cases and run them (regularly) OR tell users please run them to verify that your environment is well-configured? |
GPU support implemented with #434 Linking situation is resolved for now, but we need more trusted directories in future iterations of EESSI (to make it easier to update drivers for different accelerators) |
…ync-test+licenses Another step of syncing with EESSI
The approach for GPU support seems to work fine natively but however singularity supports CUDA, it seems to break down there. If you use the flag
--nv
to add CUDA support, you cannot actually seelibcuda.so
and friends inside the container. For us this is bad, as we need to see the actual libraries. This manifests as problems like:The text was updated successfully, but these errors were encountered: