Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问下这个报错是哪里配置的不对吗? #62

Open
beginner-wj opened this issue Mar 7, 2024 · 1 comment
Open

请问下这个报错是哪里配置的不对吗? #62

beginner-wj opened this issue Mar 7, 2024 · 1 comment

Comments

@beginner-wj
Copy link

torchrun --standalone --nproc_per_node=4 pretrain.py OR python -m torch.distributed.launch --nproc_per_node=1 pretrain.
py
[2024-03-07 17:54:29,710] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING]
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system
being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-07 17:54:29,722] torch.distributed.run: [WARNING] *****************************************
tokens per iteration will be: 32,768
breaks down as: 1 grad accum steps * 4 processes * 16 batch size * 512 max seq len
memmap:True train data.shape:(6936803, 512)
downloading finished.....
Initializing a new model from scratch
Traceback (most recent call last):
File "pretrain.py", line 239, in
torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
Traceback (most recent call last):
File "pretrain.py", line 239, in
Traceback (most recent call last):
torch._C.cuda_setDevice(device)
File "pretrain.py", line 239, in
torch.cuda.set_device(device)RuntimeError:
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init
.py", line 408, in set_device

torch.cuda.set_device(device)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\cuda_init_.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

torch._C._cuda_setDevice(device)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

num decayed parameter tensors: 57, with 58,470,912 parameters
num non-decayed parameter tensors: 17, with 8,704 parameters
using fused AdamW: True
[2024-03-07 17:54:34,906] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal CTRL_C_EVENT
[2024-03-07 17:55:04,909] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGINT
[2024-03-07 17:55:04,909] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2124 closing signal SIGTERM
Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 727, in run
result = self._invoke_run(role)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 869, in _invoke_run
run_result = self._monitor_workers(self._worker_group)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 329, in _monit
or_workers
result = self._pcontext.wait(0)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 277, in wait
return self._poll()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 661, in _poll
self.close() # terminate all running procs
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 706, in _close
handler.proc.wait(time_to_wait)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1357, in _wait
result = _winapi.WaitForSingleObject(self._handle,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 62, in _terminate_process_h
andler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1860 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\Scripts\torchrun.exe_main
.py", line 7, in
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init
.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent
result = agent.run()
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 123, in wrapper
result = f(*args, **kwargs)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 734, in run
self._shutdown(e.sigval)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\agent\server\local_elastic_agent.py", line 311, in _shutd
own
self._pcontext.close(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 318, in close
self._close(death_sig=death_sig, timeout=timeout)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 699, in _close
handler.close(death_sig=death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\elastic\multiprocessing\api.py", line 582, in close
self.proc.send_signal(death_sig)
File "C:\Users\123\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2

@xiaoyao16
Copy link

xiaoyao16 commented Dec 26, 2024

pretrain.py Line 2
add # os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants