Model can train from checkpoint but cannot continue training successively #81

Samleo8 · 2020-06-03T13:11:29Z

I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, but is unable to continue training after the first epoch is trained (starting from the checkpoint).

The main error has got to do with RuntimeError: NCCL communicator was aborted..

In case this is useful, the full error stack trace is below:

  File "train.py", line 770, in <module>
    main(args)
  File "train.py", line 727, in main
    n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment_dir, writer=writer)
  File "train.py", line 398, in one_epoch
    total_loss.backward()
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: NCCL communicator was aborted.
Traceback (most recent call last):
  File "/home/scleong/.pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, n _run_module_as_main
   "__main__", mod_spec)

The text was updated successfully, but these errors were encountered:

karfly · 2020-06-03T18:55:58Z

It's some problem with multi-gpu training.

Samleo8 · 2020-06-04T03:04:53Z

It's some problem with multi-gpu training.

Oh no, is there a workaround?

Also, if it helps NCCL Debug info:

bigfoot:8514:8514 [0] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8514:8514 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8514:8514 [0] NCCL INFO NET/IB : No device found.
bigfoot:8514:8514 [0] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
NCCL version 2.4.8+cuda10.1
bigfoot:8514:8559 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
Successfully loaded pretrained weights for whole model
Optimising model...
Loading data...
bigfoot:8516:8516 [2] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8516 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8516:8516 [2] NCCL INFO NET/IB : No device found.
bigfoot:8516:8516 [2] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8516:8560 [2] NCCL INFO Setting affinity for GPU 2 to 0fff
bigfoot:8515:8515 [1] NCCL INFO Bootstrap : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8515 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
bigfoot:8515:8515 [1] NCCL INFO NET/IB : No device found.
bigfoot:8515:8515 [1] NCCL INFO NET/Socket : Using [0]eno1:128.2.176.158<0>
bigfoot:8515:8561 [1] NCCL INFO Setting affinity for GPU 1 to 0fff
bigfoot:8514:8559 [0] NCCL INFO Channel 00 :    0   1   2
bigfoot:8515:8561 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
bigfoot:8516:8560 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
bigfoot:8514:8559 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
bigfoot:8515:8561 [1] NCCL INFO comm 0x7ff038001b40 rank 1 nranks 3 cudaDev 1 nvmlDev 1 - Init COMPLETE
bigfoot:8514:8559 [0] NCCL INFO comm 0x7fdf30001b40 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
bigfoot:8514:8514 [0] NCCL INFO Launch mode Parallel
bigfoot:8516:8560 [2] NCCL INFO comm 0x7f91ec001b40 rank 2 nranks 3 cudaDev 2 nvmlDev 2 - Init COMPLETE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model can train from checkpoint but cannot continue training successively #81

Model can train from checkpoint but cannot continue training successively #81

Samleo8 commented Jun 3, 2020 •

edited

Loading

karfly commented Jun 3, 2020

Samleo8 commented Jun 4, 2020

Model can train from checkpoint but cannot continue training successively #81

Model can train from checkpoint but cannot continue training successively #81

Comments

Samleo8 commented Jun 3, 2020 • edited Loading

karfly commented Jun 3, 2020

Samleo8 commented Jun 4, 2020

Samleo8 commented Jun 3, 2020 •

edited

Loading