-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for distributed training #34
Comments
Additionally, synchronize validation and test steps logging while using multiple GPU's as per the lightning documentation
|
Make sure to add train and test step end according to the DDP caveats. https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#dp-ddp2-caveats |
We should add a flag to DCRNN for dense tensors and throw an error when using sparse tensors with a distributed model. While using dense tensors on a single GPU is valid, we probably want to print a warning in that situation since it is less efficient. |
@klane Can we not attempt to automate these changes? |
@akashshah59 Agreed we do not need to add a new option. We can use sparse or dense tensors depending on the number of GPUs. |
Thanks, @klane for confirming that! I'm thinking of creating an abstraction that wraps around the PyTorch lightning Trainer() method. What I'm currently building can be found below Earlier implementation from pytorch_lightning import Trainer
from torchts.nn.models.dcrnn import DCRNN
model = DCRNN(adj_mx, scaler, **model_config)
trainer = Trainer(model,max_epochs=10, logger=True,gpus = 1)
trainer.fit(data['trainloader'],data['testloader']) The newer implementation will create a model object internally and would allow me to create a dense model when needed and a sparse model when needed, based on the GPUs argument. from torchts.nn.models.train_wrapper import TimeSeriesTrainer
mytrainer = TimeSeriesTrainer(model = 'dcrnn',adj_max = adj_mx,max_epochs=10, logger=True,gpus = 1)
mytrainer.fit(data['trainloader'],data['testloader']) This is cleaner in the sense that :
|
Distributed training on multiple devices generates this error.
In a distributed environment, switching sparse tensor implementations to dense operations solves this problem. However, scalability must be taken into consideration, since dense implementations take more memory.
The text was updated successfully, but these errors were encountered: