Skip to content

Commit

Permalink
[c10d] Add stream info during nccl comm abort call (pytorch#116076)
Browse files Browse the repository at this point in the history
Pull Request resolved: pytorch#116076
Approved by: https://github.com/XilunWu
  • Loading branch information
fduwjj authored and pytorchmergebot committed Dec 29, 2023
1 parent e8a9d08 commit afadfa0
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1109,8 +1109,17 @@ void ProcessGroupNCCL::abortCommsFromMap(
// their responsibility to destroy the process group and recreate
// it to recover from errors.

c10::StreamId streamId = -1;
if (ncclStreams_.find(devName) != ncclStreams_.end()) {
auto streams = ncclStreams_.at(devName);
if (streams.size() > 0) {
streamId = streams[0].id();
}
}

LOG(INFO) << logPrefix() << "] Destroyed " << ncclComms.size()
<< "communicators on CUDA device " << devName;
<< "communicators on CUDA device: " << devName
<< " with stream: " << streamId;
}
}

Expand Down

0 comments on commit afadfa0

Please sign in to comment.