RFC-0042-torch-distributed-redesign #71

youkaichao · 2024-11-08T05:36:06Z

No description provided.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-11-08T05:36:48Z

for preview, please check https://github.com/youkaichao/rfcs/blob/master/RFC-0042-torch-distributed-redesign.md

youkaichao · 2024-11-08T05:41:50Z

youkaichao · 2024-11-08T06:06:26Z

An important usecase for this, is dynamic prefill decode disaggregation: we have prefill instance and decode instance dynamically join the group, according to the workload. And they will send/recv kv caches from/to each other.

there are other solutions, like using etcd for communicating metadata, and directly use device communication libraries like our own nccl wrapper. That means completely dropping torch.distributed from our codebase though, and will be our last resort. We do want to use PyTorch as much as we can.

kumpera · 2024-11-11T17:06:25Z

The current global group is necessary for control plane operations over the cluster.

It's conflating the notion of a cluster with that of communication groups so it would be great to separate the two.

One aspect to make this feasible is whether it's possible to implement torch.distributed in terms of torch.distributed2.

youkaichao · 2024-11-11T17:42:43Z

One aspect to make this feasible is whether it's possible to implement torch.distributed in terms of torch.distributed2.

do you mean we have stateless version of process group torch.distributed2 , and re-implement global group in torch.distributed ? That can be a great idea!

wconstab · 2024-11-12T04:31:28Z

Thanks for posting this RFC!

I want to see if we can make changes to existing torch.distributed apis first to solve some/all of your problems. And then if needed, we can consider a new set of APIs (e.g. torch.distributed2).

For the src/dst to send/recv, that is something that has been bugging us for a while and I suppose we could fix it in existing APIs without worrying about BC by simply adding new kwargs to the APIs, group_src or group_dst which would be exclusive with src and dst - e.g. you can pass one or the other but not both.

For the global group, I think this might be harder to solve but I'd like to get a document started with the different possibilities and pros/cons. cc @kwen2501

d4l3k · 2024-11-12T18:31:42Z

I think a lot of what's being asked here can be done with just a new entrypoint (rather than just init_process_group) and avoid having to create a new package

That's largely what I'm doing in the torchft ProcessGroups -- just initializing the underlying PG without setting the global state. It is definitely a bit clunky (since it operates on the store API) but it's generally works just fine to instantiate a PG without calling init_process_group. https://github.com/pytorch-labs/torchft/blob/main/torchft/process_group.py

i.e. in current PyTorch you can do

from torch.distributed import ProcessGroupNCCL, TCPStore

store = TCPStore(
    host_name=host,
    port=int(port),
    is_master=False,
    wait_for_workers=False,
)
store = PrefixStore("my_custom_pg", store)

pg = ProcessGroupNCCL(store, rank=10, world_size=32)

pg.rank(), pg.size()

This can be used completely in an object oriented way without relying on any "internal" apis.

wconstab · 2024-11-12T19:18:48Z

@youkaichao would you be happy to use the workflow @d4l3k proposed? or is there still something missing?

@d4l3k is the PrefixStore needed such that each store can use a default UUID (does each store use UUID 0 or something)? I wonder if we should still provide a little bit of a helper here, (a) we could allow reusing the globally initialized TCPStore if it exists (or accept one as optional kwarg as alternative), (b) we could deal with UUID automatically somehow, and ensure that each PG still has a unique UUID somehow?

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. [ghstack-poisoned]

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. ghstack-source-id: 80af56e697db1ef61667e84a72b37d67af4c58fe Pull Request resolved: #140460

youkaichao · 2024-11-12T23:10:34Z

@d4l3k that's a great idea. I actually tried it before. however, the problem is, you cannot use pg.send/recv . there are some exceptions like torch.distributed.all_reduce that can work with these standalone groups, but torch.distributed.send/recv do not work.

youkaichao · 2024-11-12T23:16:15Z

I'm also exploring an idea of using the tcp store to directly implement a new set of send/recv/broadcast operations, in https://github.com/vllm-project/vllm/blob/377b74fe877c7eb4632c2ca0778b9da9a5db8ae6/vllm/distributed/utils.py#L127 . it works locally, but sometimes hangs during initialization in the ci though.

wconstab · 2024-11-13T03:37:52Z

@youkaichao is the only reason that send/recv do not work because of the dst/src mapping issue? I started to prototype a possible fix for that today, I'll share it here shortly.

Send/recv via tcpstore feels like it would require polling and become unscalable at large numbers of ranks. But for certain use cases it could work. We have also been thinking about better support for control-plane communication cc @c-p-i-o

youkaichao · 2024-11-13T04:11:00Z

is the only reason that send/recv do not work because of the dst/src mapping issue?

For send/recv, yes, kind of. There are other more complicated cases, though. For example, broadcast:

https://github.com/pytorch/pytorch/blob/659d2132be469a86ea34dcb7f79224c34ebb1685/torch/distributed/distributed_c10d.py#L2580

and broadcast_object_list:

https://github.com/pytorch/pytorch/blob/659d2132be469a86ea34dcb7f79224c34ebb1685/torch/distributed/distributed_c10d.py#L3239C5-L3239C26

they are quite difficult to use if i have a standalone group that is not part of the global group.

youkaichao · 2024-11-13T04:12:31Z

Send/recv via tcpstore feels like it would require polling and become unscalable at large numbers of ranks.

For tcp store (and any "store"), it should have polling by default? I don't see any polling in the example code https://pytorch.org/docs/stable/distributed.html#torch.distributed.TCPStore .

We have also been thinking about better support for control-plane communication

that would be great.

wconstab · 2024-11-13T12:10:11Z

For send/recv, yes, kind of. There are other more complicated cases, though. For example, broadcast:

ok, these look like the same thing to me. Basically, if we added support to all our APIs for 'group_src' and 'group_dst' wherever there is currently a 'src' and 'dst', it would fix the issue. That's what it looks like to me, at least.

For tcp store (and any "store"), it should have polling by default? I don't see any polling in the example code https://pytorch.org/docs/stable/distributed.html#torch.distributed.TCPStore .

Well, i'm not sure what you mean about polling by default. But if i were to build send/recv on top of tcpstore, i think my 2 choices would be (1) naive, make the recv op 'synchronous' on the CPU, and rely on the TCP timeout, (2) implement a new polling thread on the recv side that keeps checking whether a send-data has been posted. I was referring to path (2). I'm not sure if (1) is actually practical for performance reasons but we could check.

youkaichao · 2024-11-13T22:57:52Z

(1) naive, make the recv op 'synchronous' on the CPU, and rely on the TCP timeout

for my use case, (1) is enough.

if we added support to all our APIs for 'group_src' and 'group_dst' wherever there is currently a 'src' and 'dst', it would fix the issue

also need to take care of collective ops like allreduce and allgather. the goal is to support subgroups working by their own without any dependency on the global group.

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. ghstack-source-id: 94de882b6524bd7e49b3f08be84a30cb8d9b4c38 Pull Request resolved: #140460

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. ghstack-source-id: 72264a21bf53bafd0b16b7cbb961aa91cc9b5992 Pull Request resolved: #140460

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. ghstack-source-id: 72264a21bf53bafd0b16b7cbb961aa91cc9b5992 Pull Request resolved: #140460

d4l3k · 2024-11-14T15:58:42Z

@youkaichao we don't document the ProcessGroup object APIs (I'm not sure why not, we really should) but if you use them directly it should work as expected for send/recv/broadcast as the ranks are PG local rather than global

https://github.com/pytorch/pytorch/blob/f98c601efe9b426bf85d48d4949cddd01b744e55/torch/csrc/distributed/c10d/init.cpp#L2122-L2128

i.e.

pg = ProcessGroupNCCL(store, rank=10, world_size=32)
pg.send(..., local_rank, "").wait()

vs

import torch.distributed as dist

dist.send(..., global_rank, group=pg)

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]

… send/recv" Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]

Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 ghstack-source-id: 6991c41e3c488a767116ee2a9dc4f49f3a587559 Pull Request resolved: #141054

Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]

Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 ghstack-source-id: 6f61786161a67b69b05f924d46766583df20fcb3 Pull Request resolved: #141054

Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Pull Request resolved: #141054 Approved by: https://github.com/kwen2501

) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch#140827 Approved by: https://github.com/kwen2501

Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140843 Approved by: https://github.com/kwen2501

…orch#140847) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140847 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#140843

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

…1054) Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#141054 Approved by: https://github.com/kwen2501

Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. Pull Request resolved: pytorch#140460 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj

) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch#140827 Approved by: https://github.com/kwen2501

Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140843 Approved by: https://github.com/kwen2501

…orch#140847) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140847 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#140843

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

…1054) Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#141054 Approved by: https://github.com/kwen2501

Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. Pull Request resolved: pytorch#140460 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj

) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch#140827 Approved by: https://github.com/kwen2501

Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140843 Approved by: https://github.com/kwen2501

…orch#140847) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140847 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#140843

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

…1054) Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#141054 Approved by: https://github.com/kwen2501

Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. Pull Request resolved: pytorch#140460 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj

) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch#140827 Approved by: https://github.com/kwen2501

Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140843 Approved by: https://github.com/kwen2501

…orch#140847) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140847 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#140843

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

Also add missing mypy typing and a few asserts to make mypy happy ghstack-source-id: b32ee335e65a5ad069e33a0db8ee73f357e762b9 Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch/pytorch#140827

Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. ghstack-source-id: 33ea136c24295f041c95fbe0f7e1f493981865ee Pull Request resolved: pytorch/pytorch#140460

add rfc

762bae6

Signed-off-by: youkaichao <[email protected]>

facebook-github-bot added the cla signed label Nov 8, 2024

wconstab mentioned this pull request Nov 12, 2024

[C10D] Support group_dst/group_src in c10d send/recv pytorch/pytorch#140460

Closed

youkaichao mentioned this pull request Nov 12, 2024

[core][distributed] use tcp store directly vllm-project/vllm#10275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0042-torch-distributed-redesign #71

RFC-0042-torch-distributed-redesign #71

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

kumpera commented Nov 11, 2024

youkaichao commented Nov 11, 2024

wconstab commented Nov 12, 2024

d4l3k commented Nov 12, 2024 •

edited

Loading

wconstab commented Nov 12, 2024 •

edited

Loading

youkaichao commented Nov 12, 2024

youkaichao commented Nov 12, 2024

wconstab commented Nov 13, 2024

youkaichao commented Nov 13, 2024

youkaichao commented Nov 13, 2024 •

edited

Loading

wconstab commented Nov 13, 2024

youkaichao commented Nov 13, 2024

d4l3k commented Nov 14, 2024

RFC-0042-torch-distributed-redesign #71

Are you sure you want to change the base?

RFC-0042-torch-distributed-redesign #71

Conversation

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

youkaichao commented Nov 8, 2024

kumpera commented Nov 11, 2024

youkaichao commented Nov 11, 2024

wconstab commented Nov 12, 2024

d4l3k commented Nov 12, 2024 • edited Loading

wconstab commented Nov 12, 2024 • edited Loading

youkaichao commented Nov 12, 2024

youkaichao commented Nov 12, 2024

wconstab commented Nov 13, 2024

youkaichao commented Nov 13, 2024

youkaichao commented Nov 13, 2024 • edited Loading

wconstab commented Nov 13, 2024

youkaichao commented Nov 13, 2024

d4l3k commented Nov 14, 2024

d4l3k commented Nov 12, 2024 •

edited

Loading

wconstab commented Nov 12, 2024 •

edited

Loading

youkaichao commented Nov 13, 2024 •

edited

Loading