Update DDP-script.py for Windows #375

ngbrown · 2024-09-29T16:45:08Z

Appendix A code

Disable libuv because PyTorch isn't built to include it on Windows and it is most useful for high node counts.
Add comment on what Gloo is

rasbt · 2024-09-29T20:17:16Z

Hi there, and thanks for the PR! Could you share a bit more information about the issue on Windows? Does it currently result in a warning, error, or is slower? Just trying to understand this a bit better and learn what's going on.

rasbt · 2024-09-29T20:28:13Z

Could you kindly double check that the following check is set?

ngbrown · 2024-09-29T21:07:28Z

When I don't disable libuv, I get the following output and a process ending error:

RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Expand for full output

> python .\appendix-A\01_main-chapter-code\DDP-script.py
PyTorch version: 2.4.1+cu121
CUDA available: True
Number of GPUs available: 1
Traceback (most recent call last):
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 181, in <module>
    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 238, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap
    fn(i, *args)
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 116, in main
    ddp_setup(rank, world_size)  # NEW: initialize process groups
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 37, in ddp_setup
    init_process_group(backend="gloo", rank=rank, world_size=world_size)
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Also, I don't have the "Allow edit by maintainers" option because I organize most forks into an organization instead of my own account.

I updated the branch to remove the extra space in the comment.

rasbt · 2024-09-29T21:31:12Z

Thanks for explaining (I am mainly curious because I don't have a Window machine where I could test this). I made some adjustments so that this gets only called on Windows machines specifically. I tried to push it to this branch but it didn't work so I pushed it here in #376 (It includes your commit):

ngbrown · 2024-09-29T21:49:10Z

Sounds good.

d-kleine · 2024-09-29T23:18:26Z

@ngbrown Which pytorch version did you use? RVC-Boss/GPT-SoVITS#1357 (comment)

I had no issues with this notebook with 2.4 and 2.5dev on Windows 10/11.

ngbrown · 2024-09-30T02:29:33Z

@d-kleine I have torch==2.4.1+cu121 installed on a Windows 10 laptop with an NVidia Quadro M1000M laptop chip.

Before installing any other packages I installed PyTorch (per the PyTorch get started page) with:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

To run this script, that was all I installed, but even with all the other items from requirement.txt installed, I still get the same error when libuv is not disabled.

My NVidia chip is old enough I can't use CUDA 12.4, so I use the CUDA 12.1 version.

From your comment and the one you linked to, it might be that you have the CPU version installed and CUDA is unavailable for you?

d-kleine · 2024-09-30T15:26:04Z

Hm, thanks - okay, then idk either 🙃

Update DDP-script.py for Windows

4b23bda

ngbrown force-pushed the ngbrown-patch-1 branch from 16dcd9b to 4b23bda Compare September 29, 2024 21:07

rasbt mentioned this pull request Sep 29, 2024

Improve DDP on Windows #376

Merged

ngbrown closed this Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DDP-script.py for Windows #375

Update DDP-script.py for Windows #375

ngbrown commented Sep 29, 2024

rasbt commented Sep 29, 2024

rasbt commented Sep 29, 2024

ngbrown commented Sep 29, 2024 •

edited

Loading

rasbt commented Sep 29, 2024

ngbrown commented Sep 29, 2024

d-kleine commented Sep 29, 2024

ngbrown commented Sep 30, 2024 •

edited

Loading

d-kleine commented Sep 30, 2024

Update DDP-script.py for Windows #375

Update DDP-script.py for Windows #375

Conversation

ngbrown commented Sep 29, 2024

rasbt commented Sep 29, 2024

rasbt commented Sep 29, 2024

ngbrown commented Sep 29, 2024 • edited Loading

rasbt commented Sep 29, 2024

ngbrown commented Sep 29, 2024

d-kleine commented Sep 29, 2024

ngbrown commented Sep 30, 2024 • edited Loading

d-kleine commented Sep 30, 2024

ngbrown commented Sep 29, 2024 •

edited

Loading

ngbrown commented Sep 30, 2024 •

edited

Loading