Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model can train from checkpoint but cannot continue training successively #81

Open
Samleo8 opened this issue Jun 3, 2020 · 2 comments

Comments

@Samleo8
Copy link

Samleo8 commented Jun 3, 2020

I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, but is unable to continue training after the first epoch is trained (starting from the checkpoint).

The main error has got to do with RuntimeError: NCCL communicator was aborted..


In case this is useful, the full error stack trace is below:

  File "train.py", line 770, in <module>
    main(args)
  File "train.py", line 727, in main
    n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment_dir, writer=writer)
  File "train.py", line 398, in one_epoch
    total_loss.backward()
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL communicator was aborted.
Traceback (most recent call last):
  File "/home/scleong/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, n _run_module_as_main
   "__main__", mod_spec)
@karfly
Copy link
Owner

karfly commented Jun 3, 2020

It's some problem with multi-gpu training.

@Samleo8
Copy link
Author

Samleo8 commented Jun 4, 2020

It's some problem with multi-gpu training.

Oh no, is there a workaround?

Also, if it helps NCCL Debug info:

bigfoot:8514:8514 [0] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8514:8514 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8514:8514 [0] NCCL INFO NET/IB : No device found.
bigfoot:8514:8514 [0] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
NCCL version 2.4.8+cuda10.1
bigfoot:8514:8559 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
bigfoot:8516:8516 [2] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8516 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8516:8516 [2] NCCL INFO NET/IB : No device found.
bigfoot:8516:8516 [2] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8560 [2] NCCL INFO Setting affinity for GPU 2 to 0fff
bigfoot:8515:8515 [1] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8515 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8515:8515 [1] NCCL INFO NET/IB : No device found.
bigfoot:8515:8515 [1] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8561 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
bigfoot:8514:8559 [0] NCCL INFO Channel 00 :    0   1   2
bigfoot:8515:8561 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
bigfoot:8516:8560 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
bigfoot:8515:8561 [1] NCCL INFO comm 0x7ff038001b40 rank 1 nranks 3 cudaDev 1 nvmlDev 1 - Init COMPLETE
bigfoot:8514:8559 [0] NCCL INFO comm 0x7fdf30001b40 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
bigfoot:8514:8514 [0] NCCL INFO Launch mode Parallel
bigfoot:8516:8560 [2] NCCL INFO comm 0x7f91ec001b40 rank 2 nranks 3 cudaDev 2 nvmlDev 2 - Init COMPLETE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants