Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DDP-script.py for Windows #375

Closed
wants to merge 1 commit into from

Conversation

ngbrown
Copy link
Contributor

@ngbrown ngbrown commented Sep 29, 2024

Appendix A code

  • Disable libuv because PyTorch isn't built to include it on Windows and it is most useful for high node counts.
  • Add comment on what Gloo is

@rasbt
Copy link
Owner

rasbt commented Sep 29, 2024

Hi there, and thanks for the PR! Could you share a bit more information about the issue on Windows? Does it currently result in a warning, error, or is slower? Just trying to understand this a bit better and learn what's going on.

@rasbt
Copy link
Owner

rasbt commented Sep 29, 2024

Could you kindly double check that the following check is set?

Unknown

@ngbrown
Copy link
Contributor Author

ngbrown commented Sep 29, 2024

When I don't disable libuv, I get the following output and a process ending error:

RuntimeError: use_libuv was requested but PyTorch was build without libuv support
Expand for full output
> python .\appendix-A\01_main-chapter-code\DDP-script.py
PyTorch version: 2.4.1+cu121
CUDA available: True
Number of GPUs available: 1
Traceback (most recent call last):
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 181, in <module>
    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 238, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap
    fn(i, *args)
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 116, in main
    ddp_setup(rank, world_size)  # NEW: initialize process groups
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\appendix-A\01_main-chapter-code\DDP-script.py", line 37, in ddp_setup
    init_process_group(backend="gloo", rank=rank, world_size=world_size)
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\dev\OSS\LLMs-from-scratch\.venv\Lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Also, I don't have the "Allow edit by maintainers" option because I organize most forks into an organization instead of my own account.

I updated the branch to remove the extra space in the comment.

@rasbt rasbt mentioned this pull request Sep 29, 2024
@rasbt
Copy link
Owner

rasbt commented Sep 29, 2024

Thanks for explaining (I am mainly curious because I don't have a Window machine where I could test this). I made some adjustments so that this gets only called on Windows machines specifically. I tried to push it to this branch but it didn't work so I pushed it here in #376 (It includes your commit):

Screenshot 2024-09-29 at 4 31 04 PM

@ngbrown
Copy link
Contributor Author

ngbrown commented Sep 29, 2024

Sounds good.

@ngbrown ngbrown closed this Sep 29, 2024
@d-kleine
Copy link
Contributor

@ngbrown Which pytorch version did you use? RVC-Boss/GPT-SoVITS#1357 (comment)

I had no issues with this notebook with 2.4 and 2.5dev on Windows 10/11.

@ngbrown
Copy link
Contributor Author

ngbrown commented Sep 30, 2024

@d-kleine I have torch==2.4.1+cu121 installed on a Windows 10 laptop with an NVidia Quadro M1000M laptop chip.

Before installing any other packages I installed PyTorch (per the PyTorch get started page) with:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

To run this script, that was all I installed, but even with all the other items from requirement.txt installed, I still get the same error when libuv is not disabled.

My NVidia chip is old enough I can't use CUDA 12.4, so I use the CUDA 12.1 version.

From your comment and the one you linked to, it might be that you have the CPU version installed and CUDA is unavailable for you?

@d-kleine
Copy link
Contributor

Hm, thanks - okay, then idk either 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants