You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchrun --standalone --nproc_per_node=4 pretrain.py OR python -m torch.distributed.launch --nproc_per_node=1 pretrain.
py
[2024-03-07 17:54:29,710] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING]
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system
being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
tokens per iteration will be: 32,768
breaks down as: 1 grad accum steps * 4 processes * 16 batch size * 512 max seq len
memmap:True train data.shape:(6936803, 512)
downloading finished.....
Initializing a new model from scratch
Traceback (most recent call last):
File "pretrain.py", line 239, in
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
Traceback (most recent call last):
File "pretrain.py", line 239, in
Traceback (most recent call last):
torch._C.cuda_setDevice(device)
File "pretrain.py", line 239, in
torch.cuda.set_device(device)RuntimeError:
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init.py", line 408, in set_device
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
num decayed parameter tensors: 57, with 58,470,912 parameters
num non-decayed parameter tensors: 17, with 8,704 parameters
using fused AdamW: True
[2024-03-07 17:54:34,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal CTRL_C_EVENT
[2024-03-07 17:55:04,909] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGINT
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGTERM
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
result = self._invoke_run(role)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 869, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 329, in _monit
or_workers
result = self._pcontext.wait(0)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 277, in wait
return self._poll()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 661, in _poll
self.close() # terminate all running procs
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 706, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1357, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_h
andler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1860 got signal: 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\Scripts\torchrun.exe_main.py", line 7, in
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
result = agent.run()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 734, in run
self._shutdown(e.sigval)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 311, in _shutd
own
self._pcontext.close(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 699, in _close
handler.close(death_sig=death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 582, in close
self.proc.send_signal(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
The text was updated successfully, but these errors were encountered:
torchrun --standalone --nproc_per_node=4 pretrain.py OR python -m torch.distributed.launch --nproc_per_node=1 pretrain.
py
[2024-03-07 17:54:29,710] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING]
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system
being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
tokens per iteration will be: 32,768
breaks down as: 1 grad accum steps * 4 processes * 16 batch size * 512 max seq len
memmap:True train data.shape:(6936803, 512)
downloading finished.....
Initializing a new model from scratch
Traceback (most recent call last):
File "pretrain.py", line 239, in
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
Traceback (most recent call last):
File "pretrain.py", line 239, in
Traceback (most recent call last):
torch._C.cuda_setDevice(device)
File "pretrain.py", line 239, in
torch.cuda.set_device(device)RuntimeError:
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init.py", line 408, in set_device
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.num decayed parameter tensors: 57, with 58,470,912 parameters
num non-decayed parameter tensors: 17, with 8,704 parameters
using fused AdamW: True
[2024-03-07 17:54:34,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal CTRL_C_EVENT
[2024-03-07 17:55:04,909] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGINT
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGTERM
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
result = self._invoke_run(role)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 869, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 329, in _monit
or_workers
result = self._pcontext.wait(0)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 277, in wait
return self._poll()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 661, in _poll
self.close() # terminate all running procs
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 706, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1357, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_h
andler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1860 got signal: 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\Scripts\torchrun.exe_main.py", line 7, in
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
result = agent.run()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 734, in run
self._shutdown(e.sigval)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 311, in _shutd
own
self._pcontext.close(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 699, in _close
handler.close(death_sig=death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 582, in close
self.proc.send_signal(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
The text was updated successfully, but these errors were encountered: