Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU fails when using ConcatDataLoader #3047

Open
absudabsu opened this issue Nov 18, 2024 · 1 comment · May be fixed by #3053
Open

Multi-GPU fails when using ConcatDataLoader #3047

absudabsu opened this issue Nov 18, 2024 · 1 comment · May be fixed by #3053
Labels

Comments

@absudabsu
Copy link

When utilizing multi-gpu training with DDP, e.g. with a SemiSupervisedDataLoader and subsequently ConcatDataLoader, an error is raised because distributed_sampler keyword argument required for AnnDataLoader is not explicit in the class definition of ConcatDataLoader. Passing this keyword argument in using the wildcard allows AnnDataLoader to be initialized properly, but throws an error for initializing the parent class torch.util.DataLoader using the same distributed_sampler keyword argument.

datasplitter_kwargs = {}
datasplitter_kwargs['distributed_sampler'] = True
data_splitter = SemiSupervisedDataSplitter(
            adata_manager=self.adata_manager,
            train_size=train_size,
            validation_size=validation_size,
            shuffle_set_split=shuffle_set_split,
            n_samples_per_label=n_samples_per_label,
            batch_size=batch_size,
            **datasplitter_kwargs,
        )
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/scvi/train/_trainrunner.py", line 98, in __call__
2024-11-18T23:06:15.416Z [rank1]:     self.trainer.fit(self.training_plan, self.data_splitter)
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/scvi/train/_trainer.py", line 219, in fit
2024-11-18T23:06:15.416Z [rank1]:     super().fit(*args, **kwargs)
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
2024-11-18T23:06:15.416Z [rank1]:     call._call_and_handle_interrupt(
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
2024-11-18T23:06:15.416Z [rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
2024-11-18T23:06:15.416Z [rank1]:     return function(*args, **kwargs)
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
2024-11-18T23:06:15.416Z [rank1]:     self._run(model, ckpt_path=ckpt_path)
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
2024-11-18T23:06:15.416Z [rank1]:     results = self._run_stage()
2024-11-18T23:06:15.416Z [rank1]:               ^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
2024-11-18T23:06:15.416Z [rank1]:     self.fit_loop.run()
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 194, in run
2024-11-18T23:06:15.416Z [rank1]:     self.setup_data()
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 222, in setup_data
2024-11-18T23:06:15.416Z [rank1]:     train_dataloader = _request_dataloader(source)
2024-11-18T23:06:15.416Z [rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 342, in _request_dataloader
2024-11-18T23:06:15.416Z [rank1]:     return data_source.dataloader()
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 309, in dataloader
2024-11-18T23:06:15.416Z [rank1]:     return call._call_lightning_datamodule_hook(self.instance.trainer, self.name)
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 179, in _call_lightning_datamodule_hook
2024-11-18T23:06:15.416Z [rank1]:     return fn(*args, **kwargs)
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/scvi/dataloaders/_data_splitting.py", line 332, in train_dataloader
2024-11-18T23:06:15.416Z [rank1]:     return self.data_loader_class(
2024-11-18T23:06:15.416Z [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/fabric/utilities/data.py", line 324, in wrapper
2024-11-18T23:06:15.416Z [rank1]:     init(obj, *args, **kwargs)
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/scvi/dataloaders/_semi_dataloader.py", line 75, in __init__
2024-11-18T23:06:15.416Z [rank1]:     super().__init__(
2024-11-18T23:06:15.416Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/fabric/utilities/data.py", line 324, in wrapper
2024-11-18T23:06:15.417Z [rank1]:     init(obj, *args, **kwargs)
2024-11-18T23:06:15.417Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/scvi/dataloaders/_concat_dataloader.py", line 65, in __init__
2024-11-18T23:06:15.417Z [rank1]:     super().__init__(self.largest_dl, **data_loader_kwargs)
2024-11-18T23:06:15.417Z [rank1]:   File "/stage/env/lib/python3.11/site-packages/lightning/fabric/utilities/data.py", line 324, in wrapper
2024-11-18T23:06:15.417Z [rank1]:     init(obj, *args, **kwargs)
2024-11-18T23:06:15.417Z [rank1]: TypeError: DataLoader.__init__() got an unexpected keyword argument 'distributed_sampler'

Versions:

VERSION 1.1.2

@absudabsu absudabsu added the bug label Nov 18, 2024
@canergen
Copy link
Member

Hi, multi-GPU support is still experimental. We need to include tests using multiple GPUs and need to test it more. Timeline is Q1 2025 and it's planned to be released with 1.3.

@ori-kron-wis ori-kron-wis linked a pull request Nov 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants