You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an error when trying to run the code of vesselformer (train.py) on a GPU Cluster of my university. I tried to reduce the batch size (from 50 to 2), but still, it is not able to train properly.
2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
2023-08-28 12:02:44,306 ignite.distributed.launcher.Parallel INFO: - Run '<function main at 0x7f2cd2855670>' in 1 processes
2023-08-28 12:02:44,379 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
{'batch_size': 50, 'shuffle': True, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
2023-08-28 12:02:44,397 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset '<dataset_vessel3d.ve':
{'batch_size': 50, 'shuffle': False, 'num_workers': 16, 'collate_fn': <function image_graph_collate at 0x7f2cd2a1b040>, 'pin_memory': True}
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
2023-08-28 12:02:46,254 ignite.distributed.auto.auto_model INFO: Apply torch DataParallel on model
Current run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
self.state.batch = next(self._dataloader_iter)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
success, data = self._try_get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
Engine run is terminating due to exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Exception: DataLoader worker (pid(s) 411948) exited unexpectedly
Traceback (most recent call last):
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
self._handle_exception(e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
raise e
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
self.state.batch = next(self._dataloader_iter)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
success, data = self._try_get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
2023-08-28 12:03:30,077 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
Traceback (most recent call last):
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 411948) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 205, in <module>
parallel.run(main, args)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 316, in run
func(local_rank, *args, **kwargs)
File "train.py", line 196, in main
trainer.run()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run
super().run()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 704, in run
return self._internal_run()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 783, in _internal_run
self._handle_exception(e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
raise e
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 753, in _internal_run
time_taken = self._run_once_on_dataset()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 854, in _run_once_on_dataset
self._handle_exception(e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 464, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 421, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/monai/handlers/stats_handler.py", line 148, in exception_raised
raise e
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/ignite/engine/engine.py", line 807, in _run_once_on_dataset
self.state.batch = next(self._dataloader_iter)
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data
success, data = self._try_get_data()
File "/home/guests/pascual_cervera/miniconda3/envs/lvs/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 411948) exited unexpectedly
slurmstepd: error: Detected 9 oom-kill event(s) in StepId=20449.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Any help you can provide will be welcomed.
The text was updated successfully, but these errors were encountered:
Hello,
I have an error when trying to run the code of vesselformer (train.py) on a GPU Cluster of my university. I tried to reduce the batch size (from 50 to 2), but still, it is not able to train properly.
Any help you can provide will be welcomed.
The text was updated successfully, but these errors were encountered: