Avoid poisoning process with CUDA calls as soon as importing #6810
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Switch from
torch.cuda.is_available()
totorch.cuda.device_count() > 0
, to give priority to nvml based availability, so that we can try not to poison process with CUDA calls as soon as we executeimport deepspeed
.https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124
There are 2 reasons to make this change:
Firstly, if we accidentally import deepspeed, since the CUDA runtime initializes when the first CUDA API call is made and caches the device list, changing the CUDA_VISIBLE_DEVICES within the same process after initialization won't have any effect on the visible devices. The specific case:
OpenRLHF/OpenRLHF#524 (comment)
A demo for reproduction before the fix is applied:
Secondly, https://pytorch.org/docs/stable/notes/cuda.html
When assessing the availability of CUDA in a given environment (is_available()), PyTorch’s default behavior is to call the CUDA Runtime API method cudaGetDeviceCount. Because this call in turn initializes the CUDA Driver API (via cuInit) if it is not already initialized, subsequent forks of a process that has run is_available() will fail with a CUDA initialization error.