Add support for distributed training #34

akashshah59 · 2021-06-16T04:07:25Z

Distributed training on multiple devices generates this error.

dcrnn_gpu.py:16: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
2021-05-14 04:43:15.966166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "dcrnn_gpu.py", line 34, in <module>
    trainer.fit(model, data["train_loader"], data["val_loader"])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in _init_
    super()._init_(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in _init_
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 131, in reduce_tensor
    storage = tensor.storage()
RuntimeError: sparse tensors do not have storage

In a distributed environment, switching sparse tensor implementations to dense operations solves this problem. However, scalability must be taken into consideration, since dense implementations take more memory.

The text was updated successfully, but these errors were encountered:

akashshah59 · 2021-06-16T04:54:03Z

Additionally, synchronize validation and test steps logging while using multiple GPU's as per the lightning documentation

sync_dist=True

akashshah59 · 2021-06-16T05:18:44Z

Make sure to add train and test step end according to the DDP caveats.

https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#dp-ddp2-caveats

klane · 2021-06-16T06:01:53Z

We should add a flag to DCRNN for dense tensors and throw an error when using sparse tensors with a distributed model. While using dense tensors on a single GPU is valid, we probably want to print a warning in that situation since it is less efficient.

akashshah59 · 2021-06-16T21:27:43Z

@klane Can we not attempt to automate these changes?
By this I mean to convert to dense when the number of GPUs is provided, if the memory bloats we could suggest lowering batch size.

klane · 2021-06-17T01:55:19Z

@akashshah59 Agreed we do not need to add a new option. We can use sparse or dense tensors depending on the number of GPUs.

akashshah59 · 2021-06-22T03:48:52Z

Thanks, @klane for confirming that!

I'm thinking of creating an abstraction that wraps around the PyTorch lightning Trainer() method.

What I'm currently building can be found below

Earlier implementation

from pytorch_lightning import Trainer
from torchts.nn.models.dcrnn import DCRNN

model = DCRNN(adj_mx, scaler, **model_config)
trainer = Trainer(model,max_epochs=10, logger=True,gpus = 1)
trainer.fit(data['trainloader'],data['testloader'])

The newer implementation will create a model object internally and would allow me to create a dense model when needed and a sparse model when needed, based on the GPUs argument.

from torchts.nn.models.train_wrapper import TimeSeriesTrainer

mytrainer = TimeSeriesTrainer(model = 'dcrnn',adj_max = adj_mx,max_epochs=10, logger=True,gpus = 1)
mytrainer.fit(data['trainloader'],data['testloader'])

This is cleaner in the sense that :

The user is not expected to import Pytorch Lightning.
Users do not have to create a model
We get more control over the existing arguments and such abstraction could be advantageous in the future, for example (Different types of scalers can be developed internally inside the class and exposed as a parameter)

akashshah59 self-assigned this Jun 16, 2021

akashshah59 added the enhancement New feature or request label Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for distributed training #34

Add support for distributed training #34

akashshah59 commented Jun 16, 2021 •

edited

Loading

akashshah59 commented Jun 16, 2021 •

edited

Loading

akashshah59 commented Jun 16, 2021

klane commented Jun 16, 2021

akashshah59 commented Jun 16, 2021

klane commented Jun 17, 2021

akashshah59 commented Jun 22, 2021 •

edited

Loading

Add support for distributed training #34

Add support for distributed training #34

Comments

akashshah59 commented Jun 16, 2021 • edited Loading

akashshah59 commented Jun 16, 2021 • edited Loading

akashshah59 commented Jun 16, 2021

klane commented Jun 16, 2021

akashshah59 commented Jun 16, 2021

klane commented Jun 17, 2021

akashshah59 commented Jun 22, 2021 • edited Loading

akashshah59 commented Jun 16, 2021 •

edited

Loading

akashshah59 commented Jun 16, 2021 •

edited

Loading

akashshah59 commented Jun 22, 2021 •

edited

Loading