-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
show expected and problematic output produced by deviceQuery in GPU docs #139
base: main
Are you sure you want to change the base?
Conversation
... | ||
``` | ||
|
||
If the `deviceQuery` command can not access your GPU, you will see an error message like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't actually happen though, because of the Lmod guards the only scenario I can see where you would reach this is where you are using a container and the system drivers are too old
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I triggered it by cleaning out the host_injections
directory after loading the module.
I agree it's very unlikely that it happens, but we should mention it in the docs regardless, if only to let people easily find this page when searching for error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is that the placement here makes it seem like it not working is likely, but reaching this message is actually very unlikely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a little box saying What does it look like if the command fails?
@@ -152,10 +152,32 @@ The only scenario where this would be required is if `$LD_LIBRARY_PATH` is modif | |||
|
|||
### Testing the GPU support {: #gpu_cuda_testing } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this only treats testing if you can run CUDA-enabled software from EESSI. Maybe we can also include a small instruction for testing if building new CUDA software on top of EESSI works properly. Something like this:
First, create a file hello_cuda.cu
with the contents
#include <stdio.h>
__global__ void helloCUDA()
{
printf("Hello, CUDA!\n");
}
int main()
{
helloCUDA<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
Then
module load CUDA/<some_version>
nvcc -o hello_cuda.cu -o hello_cuda
chmod u+x hello_cuda
./hello_cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And mention they should test this for each version of CUDA they installed in host_injections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but that should be done in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, sure. I won't block this one over it :) Although I would consider it to be an integral part of "Testing the GPU support" to be honest :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it as so integral if we are focused on software consumers, it's only integral if you want to do development-type work
If the `deviceQuery` command can not access your GPU, you will see an error message like: | ||
``` | ||
cudaGetDeviceCount returned 35 | ||
-> CUDA driver version is insufficient for CUDA runtime version | ||
Result = FAIL | ||
``` | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the `deviceQuery` command can not access your GPU, you will see an error message like: | |
``` | |
cudaGetDeviceCount returned 35 | |
-> CUDA driver version is insufficient for CUDA runtime version | |
Result = FAIL | |
``` | |
``` | |
!!! note "What if the `deviceQuery` command fails?" | |
If the `deviceQuery` command cannot access your GPU, you will see an error message like: | |
``` | |
cudaGetDeviceCount returned 35 | |
-> CUDA driver version is insufficient for CUDA runtime version | |
Result = FAIL | |
``` | |
showing output in case it doesn't work is useful for searching purposes...