Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly #18

Open
pascual-tejero opened this issue Aug 28, 2023 · 0 comments

Comments

@pascual-tejero
Copy link

pascual-tejero commented Aug 28, 2023

Hello,

I have an error when trying to run the code of vesselformer (train.py) on a GPU Cluster of my university. I tried to reduce the batch size (from 50 to 2), but still, it is not able to train properly.

2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: - Run '<function main at 0x7f2cd2855670>' in 1 processes
2023-08-28 12:02:44,379 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
        {'batch_size': 50, 'shuffle': True, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2023-08-28 12:02:44,397 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
        {'batch_size': 50, 'shuffle': False, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
Current run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
Engine run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
2023-08-28 12:03:30,077 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
Traceback (most recent call last):
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 205, in <module>
    parallel.run(main, args)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 316, in run
    func(local_rank, *args, **kwargs)
  File "train.py", line 196, in main
    trainer.run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run
    super().run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 704, in run
    return self._internal_run()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 783, in _internal_run
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
    raise e
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
    self.state.batch = next(self._dataloader_iter)
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
    success, data = self._try_get_data()
  File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
slurmstepd: error: Detected 9 oom-kill event(s) in StepId=20449.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Any help you can provide will be welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant