Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA driver #169

Open
RichardScottOZ opened this issue Aug 14, 2022 · 4 comments
Open

NVIDIA driver #169

RichardScottOZ opened this issue Aug 14, 2022 · 4 comments

Comments

@RichardScottOZ
Copy link
Contributor

This is an odd one, unless it is the type

detect_worker_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 384, in current_device
detect_worker_1  |     _lazy_init()
detect_worker_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 186, in _lazy_init
detect_worker_1  |     _check_driver()
detect_worker_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 65, in _check_driver
detect_worker_1  |     raise AssertionError("""
detect_worker_1  | AssertionError:
detect_worker_1  | Found no NVIDIA driver on your system. Please check that you
detect_worker_1  | have an NVIDIA GPU and installed a driver from
detect_worker_1  | http://www.nvidia.com/Download/index.aspx
detect_worker_1  | distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=15 parent=1 started daemon>
cosmos_detect_worker_1 exited with code 1
qnvida^CGracefully stopping... (press Ctrl+C again to force)
Stopping cosmos_worker_1    ...
Stopping cosmos_runner_1    ...
Stopping cosmos_scheduler_1 ...
Killing cosmos_worker_1    ... done
Killing cosmos_runner_1    ... done
Killing cosmos_scheduler_1 ... done
(tensorflow2_p38) ubuntu@ip-172-31-15-73:~/data/Cosmos$ ^C
(tensorflow2_p38) ubuntu@ip-172-31-15-73:~/data/Cosmos$ nvidia-smi
Sun Aug 14 03:07:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   26C    P8    16W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Used these for many things - have to try an older version sometime.

@iross
Copy link
Contributor

iross commented Aug 15, 2022

Hi Richard--
Looks like nvidia-smi works fine on the host machine, but have you run it within a docker container? The issue may be in the nvidia-docker interface.

@RichardScottOZ
Copy link
Contributor Author

Hi Ian,

Yeah, I was running it in docker, so that was my thought too, but thought I would ask.

If going to use it on a lot of things rather than just some tests - be better to get it running natively?

@RichardScottOZ
Copy link
Contributor Author

Or will it be fine if can work out what the compose config tweaks [and additional setup necessary' might be?

e.g. whatever additional nvidia container wizardry might need to go on a generic machine setup?

@RichardScottOZ
Copy link
Contributor Author

some suggestions of things like this

    deploy:
       resources:
         reservations:
           devices:
             - driver: nvidia
               count: 1
               capabilities: [gpu]

How that would tie into your not so straightforward setup there, not sure.

@RichardScottOZ RichardScottOZ changed the title NVIDA driver NVIDIA driver Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants