-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model can train from checkpoint but cannot continue training successively #81
Comments
It's some problem with multi-gpu training. |
Oh no, is there a workaround? Also, if it helps NCCL Debug info:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, but is unable to continue training after the first epoch is trained (starting from the checkpoint).
The main error has got to do with
RuntimeError: NCCL communicator was aborted.
.In case this is useful, the full error stack trace is below:
The text was updated successfully, but these errors were encountered: