请问下这个报错是哪里配置的不对吗？ #62

beginner-wj · 2024-03-07T10:07:48Z

torchrun --standalone --nproc_per_node=4 pretrain.py OR python -m torch.distributed.launch --nproc_per_node=1 pretrain.
py
[2024-03-07 17:54:29,710] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING]
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system
being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
tokens per iteration will be: 32,768
breaks down as: 1 grad accum steps * 4 processes * 16 batch size * 512 max seq len
memmap:True train data.shape:(6936803, 512)
downloading finished.....
Initializing a new model from scratch
Traceback (most recent call last):
File "pretrain.py", line 239, in
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
Traceback (most recent call last):
File "pretrain.py", line 239, in
Traceback (most recent call last):
torch._C.cuda_setDevice(device)
File "pretrain.py", line 239, in
torch.cuda.set_device(device)RuntimeError:
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init.py", line 408, in set_device

torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

torch._C._cuda_setDevice(device)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

num decayed parameter tensors: 57, with 58,470,912 parameters
num non-decayed parameter tensors: 17, with 8,704 parameters
using fused AdamW: True
[2024-03-07 17:54:34,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal CTRL_C_EVENT
[2024-03-07 17:55:04,909] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGINT
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGTERM
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
result = self._invoke_run(role)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 869, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 329, in _monit
or_workers
result = self._pcontext.wait(0)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 277, in wait
return self._poll()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 661, in _poll
self.close() # terminate all running procs
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 706, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1357, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_h
andler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1860 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\Scripts\torchrun.exe_main.py", line 7, in
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
result = agent.run()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 734, in run
self._shutdown(e.sigval)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 311, in _shutd
own
self._pcontext.close(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 699, in _close
handler.close(death_sig=death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 582, in close
self.proc.send_signal(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2

The text was updated successfully, but these errors were encountered:

xiaoyao16 · 2024-12-26T09:52:40Z

pretrain.py Line 2
add # os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问下这个报错是哪里配置的不对吗？ #62

请问下这个报错是哪里配置的不对吗？ #62

beginner-wj commented Mar 7, 2024

xiaoyao16 commented Dec 26, 2024 •

edited

Loading

请问下这个报错是哪里配置的不对吗？ #62

请问下这个报错是哪里配置的不对吗？ #62

Comments

beginner-wj commented Mar 7, 2024

xiaoyao16 commented Dec 26, 2024 • edited Loading

xiaoyao16 commented Dec 26, 2024 •

edited

Loading