Skip to content

Commit

Permalink
[distributed] add PG APIs and general doc cleanups (pytorch#140853)
Browse files Browse the repository at this point in the history
Doc updates:

* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis

Test plan:

```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```

Pull Request resolved: pytorch#140853
Approved by: https://github.com/kwen2501
  • Loading branch information
d4l3k authored and fmo-mt committed Dec 11, 2024
1 parent b69375a commit 95f7628
Show file tree
Hide file tree
Showing 2 changed files with 201 additions and 72 deletions.
61 changes: 35 additions & 26 deletions docs/source/distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -329,32 +329,6 @@ a github issue or RFC if this is a use case that's blocking you.

--------------------------------------------------------------------------------

Distributed Key-Value Store
---------------------------

The distributed package comes with a distributed key-value store, which can be
used to share information between processes in the group as well as to
initialize the distributed package in
:func:`torch.distributed.init_process_group` (by explicitly creating the store
as an alternative to specifying ``init_method``.) There are 3 choices for
Key-Value Stores: :class:`~torch.distributed.TCPStore`,
:class:`~torch.distributed.FileStore`, and :class:`~torch.distributed.HashStore`.

.. autoclass:: Store
.. autoclass:: TCPStore
.. autoclass:: HashStore
.. autoclass:: FileStore
.. autoclass:: PrefixStore

.. autofunction:: torch.distributed.Store.set
.. autofunction:: torch.distributed.Store.get
.. autofunction:: torch.distributed.Store.add
.. autofunction:: torch.distributed.Store.compare_set
.. autofunction:: torch.distributed.Store.wait
.. autofunction:: torch.distributed.Store.num_keys
.. autofunction:: torch.distributed.Store.delete_key
.. autofunction:: torch.distributed.Store.set_timeout

Groups
------

Expand Down Expand Up @@ -386,6 +360,7 @@ distributed process group easily. :func:`~torch.distributed.device_mesh.init_dev
used to create new DeviceMesh, with a mesh shape describing the device topology.

.. autoclass:: torch.distributed.device_mesh.DeviceMesh
:members:

Point-to-point communication
----------------------------
Expand Down Expand Up @@ -506,6 +481,7 @@ Collective functions
.. autofunction:: monitored_barrier

.. autoclass:: Work
:members:

.. autoclass:: ReduceOp

Expand All @@ -516,6 +492,39 @@ Collective functions

:class:`~torch.distributed.ReduceOp` is recommended to use instead.


Distributed Key-Value Store
---------------------------

The distributed package comes with a distributed key-value store, which can be
used to share information between processes in the group as well as to
initialize the distributed package in
:func:`torch.distributed.init_process_group` (by explicitly creating the store
as an alternative to specifying ``init_method``.) There are 3 choices for
Key-Value Stores: :class:`~torch.distributed.TCPStore`,
:class:`~torch.distributed.FileStore`, and :class:`~torch.distributed.HashStore`.

.. autoclass:: Store
:members:
:special-members:

.. autoclass:: TCPStore
:members:
:special-members: __init__

.. autoclass:: HashStore
:members:
:special-members: __init__

.. autoclass:: FileStore
:members:
:special-members: __init__

.. autoclass:: PrefixStore
:members:
:special-members: __init__


Profiling Collective Communication
-----------------------------------------

Expand Down
Loading

0 comments on commit 95f7628

Please sign in to comment.