Allgather with DID loop split #3284

wujingyue · 2024-10-25T23:36:02Z

Another baby step towards #2563

csrc/multidevice/utils.cpp

tests/cpp/multidevice.cpp

wujingyue · 2024-11-28T00:14:51Z

!test

wujingyue · 2024-11-30T05:49:54Z

!test

wujingyue · 2024-12-02T06:44:37Z

cc @xwang233 the testing infra appears to be problematic for H100: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/125617985

samnordmann

Thank you for this pr! I am still trying to fully understand how the logic works, but let me post a series of minor comments in the meantime.

csrc/multidevice/utils.cpp

samnordmann · 2024-12-03T17:12:54Z

csrc/multidevice/utils.cpp

+
+  const auto iter = std::find(
+      tv->getLogicalDomain().begin(), tv->getLogicalDomain().end(), inputs[0]);
+  NVF_ERROR(


I am not sure to understand why this check is needed. Isn't it true that by assumption what is returned by getInputsTo is an element of tv->getLogicalDomain()?

Also I am not sure what is meant by "dominate" in the error message

Re dominate: https://en.wikipedia.org/wiki/Dominator_(graph_theory) and I extended the concept to a set of nodes dominating another set.

Re the check: I heard from @naoyam that logical won't always dominate allocation with "the new indexing system".

csrc/multidevice/utils.h

csrc/multidevice/communication.cpp

samnordmann · 2024-12-03T17:36:35Z

csrc/multidevice/lower_communication.cpp

@@ -196,7 +196,7 @@ void lowerToReduceScatter(
    std::vector<Communication*>& comms) {
  const DeviceMesh& mesh = input_tv->getDeviceMesh();
  auto reduction_axis = output_tv->getReductionAxis().value();
-  auto scattered_axis = getShardedAxis(output_tv);
+  auto scattered_axis = getShardedAxis(output_tv, ParallelType::DIDx);


ok, however if the sharded dimension is split, then scatted_axis is not valid here, right?

I can't think of an immediate problem and #3504 apparently works fine. Could be incidental and I'm happy to hear what you think is problematic.

Correct me if I'm wrong, but I think this is an example where we see the problem:

d=num_devices; tv0 [d, i1]; tv1 = sum(tv0, axis=0); // tv1 [r{i0}, i1] tv0->axis(0)->parallelize(DIDx); tv1->axis(1)->split(d); // [r{i0}, i1/4, d] tv1->axis(2)->parallelize(DIDx)

In this case, the scattered axis is 2 but getShardedAxis returns 1.

In your case,

tv0: logical: [iDID{i0}, i{i1}] tv1: logical: [r{i0}, i{i1}] allocation: [r{i0}, i{i1/d}, iDID{d}]

getShardedLogicalAxis will return 0, the tensor axis being sharded. This is correct because the output at::Tensor for tv1 will be of shape [i1/d] and indeed axis 0 is the sharded dimension. Then, scattered_axis=0 will be used to compute which input tensor axis will be sharded (which will be 1). Finally, that input scattered axis (1) will be used to split the input tensor of shape [1, i1].

Caveat: With 7cf2384, DID'ing an inner split is disallowed by code. So the above case will actually throw an exception. But what I said should be correct after we lift that limitation.

samnordmann · 2024-12-03T17:43:31Z

tests/cpp/test_multidevice_lower_communication.cpp

+      at::randn({num_devices * kTensorSize}, at::kFloat);
+  at::Tensor in_tensor =
+      shardTensor(unsharded_tensor, in).to(communicator_->device());
+


Suggested change

std::vector<int64_t> ref_in_tensor_shape = {kTensorSize};

EXPECT_EQ(in_tensor.sizes(), ref_in_tensor_shape);

I don't understand how shardTensor can be correct here if it never replays the split backwards... But I might be missing something.

Thanks for the review! I think there are two problems with the PR as is:

shardTensor may slice wrong numbers. For example, if an inner split is DID'ed, the slicing needs to be strided per the outer split.

nvFuser doesn't error out when Allgather is not along the outermost allocated dimension. This was guaranteed by ReorderShardedAxisPass by checking isInnerResharding. However, getShardingChanges, one of its dependencies, hasn't been updated to read loop/allocation:

Fuser/csrc/multidevice/utils.cpp

Line 77 in 67127c9

auto rootmap = PairwiseLogicalDomainMap(input, output).mapBroadcast(false);

Re the suggested change: I manually checked the shape is as expected. I added some extra unit tests for shardTensor alone, so we don't have to verify it here.

I made a couple of changes to address the problems I said in #3284 (comment).

7cf2384. It's an overkill but will probably be OK for quite some time. I had a hard time finding a concrete use case that has to mix DID and host ID within one logical dimension. I agree that to properly support inner splits we'll need to "replay the split backwards". It's not a trivial change anyhow so I'll postpone it to a separate PR.

I wrote Harden assertBuffersHaveSameSize to check shapes. #3531 to harden runtime checks for allgather and added to this PR one more allgather test (Allgather_LoopSplit_Noncontiguous). These extra checks will fire when we trigger some most common limitations before properly fixing ReorderShardedAxisPass, which will take several decent-size PRs.

I had a hard time finding a concrete use case that has to mix DID and host ID within one logical dimension.

In fact, there's

Fuser/tests/cpp/test_multidevice_overlap.cpp

Line 681 in 64bc560

// A has shape (S, sharded(D), M/(S*D), K)

. So I'll try to file a feature request after this PR.

tests/cpp/multidevice.cpp

cowanmeg · 2024-12-04T02:31:28Z

tests/cpp/test_multidevice_lower_communication.cpp

+  at::Tensor out_tensor = fec.runFusionWithInputs({in_tensor})[0];
+  assertIsCompiledToHostIrContainer(fec);
+
+  EXPECT_TRUE(at::equal(out_tensor.cpu(), unsharded_tensor));


Why not use validate here?

I noticed allgather's lowering was not changed...I'm a bit surprised it didn't need any modifications for inputs with DID loop split! I might have missed a few earlier PRs though

Why not use validate here?

Since validate allows for (small) differences, if two tensors are supposed to be exactly the same, just using the simpler validation method, i.e., at::equal, would be more preferable.

I'm a bit surprised it didn't need any modifications for inputs with DID loop split!

Whether we call lowerToAllGather depends on I/O meshes and whether I/O is sharded:

Fuser/csrc/multidevice/lower_communication.cpp

Line 285 in 67127c9

lowerToAllgather(input_tv, output_tv, comms);

. isSharded have been reading the allocation domain ince #3444.

That being said, I think this PR as is is a bit too permissive and may lower a set to Allgather without properly checking its allocation domain. For example,

Fuser/csrc/multidevice/utils.cpp

Line 77 in 67127c9

auto rootmap = PairwiseLogicalDomainMap(input, output).mapBroadcast(false);

reads root and logical and needs to be updated. I'll try to fix that.

That being said, I think this PR as is is a bit too permissive and may lower a set to Allgather without properly checking its allocation domain.

I tried to address this in #3284 (comment).

samnordmann · 2024-12-04T12:02:25Z

Thanks for the review! I think there are two problems with the PR as is:

shardTensor may slice wrong numbers. For example, if an inner split is DID'ed, the slicing needs to be strided per the outer split.

nvFuser doesn't error out when Allgather is not along the outermost allocated dimension. This was guaranteed by ReorderShardedAxisPass by checking isInnerResharding. However, getShardingChanges, one of its dependencies, hasn't been updated to read loop/allocation:

Fuser/csrc/multidevice/utils.cpp

Line 77 in 67127c9

auto rootmap = PairwiseLogicalDomainMap(input, output).mapBroadcast(false);

I see, thanks for the clarifications. Is it still ready for review or would you like to fix the points you mentioned above?

On my side, I understand why the PR works as is, but I am a little bit concerned because, in my opinion, it relies on a weak contract. To be more precise, the tests only work here because, when splitting the innermost loop axis and sharding one split, the allgather will still operate on innermost dimension (index 0), which has the same index as the logical axis used to produce that loop axis. The fact that this works feels a bit incidental and not robust, in particular it should fail for any variants like, merging axis, transposition (even on non-sharded axis), multidimensional Parallelization (e.g. DIDy), etc. Wdyt?

wujingyue · 2024-12-04T18:14:26Z

Is it still ready for review or would you like to fix the points you mentioned above?

I'll try to fix these points. I'd like nvFuser to at least error out on cases it doesn't support. This way, when we trigger a known limitation in the future, it'll show up as an exception instead of silently generating wrong numbers.

csrc/multidevice/utils.cpp

csrc/multidevice/utils.h

I wrote this to make the allgather-related issue discovered in #3284 (comment) easier to expose. And it seems a good runtime check to have in extra, because `_allgather_base` treats I/O tensors as flat buffers and ignores the shapes.

csrc/multidevice/utils.cpp

naoyam

LGTM

wujingyue · 2024-12-09T21:51:43Z

!test

wujingyue force-pushed the wjy/comm branch 2 times, most recently from 3c5c68e to 5ef8ff4 Compare November 18, 2024 18:50

wujingyue added 11 commits November 25, 2024 20:01

Add a repro for #3282

04e06a8

Remove an assumption in the transpose scheduler.

416f1d0

Reimplement unshardSizesAndStrides.

2c984c8

Inherit parallel type for new allocation IDs

44b5091

Fix broadcast tests

521c783

Unify unshardedSizes.

9ff10cf

Fix a test

8bd9486

Refine the logic in the transpose scheduler

5a16349

Comment

2768970

Merge branch 'main' into wjy/forward

e38962e

Resolve two fixmes

400684e

wujingyue force-pushed the wjy/comm branch from 5ef8ff4 to bd177e3 Compare November 27, 2024 19:40

wujingyue changed the base branch from main to wjy/forward November 27, 2024 19:42

wujingyue force-pushed the wjy/comm branch from bd177e3 to 0bd02b8 Compare November 27, 2024 19:45

wujingyue added 2 commits November 27, 2024 12:19

Add a repro.

3086237

Generalize postAllgather to support DID loop split.

fe0cec6

wujingyue force-pushed the wjy/comm branch from 0bd02b8 to fe0cec6 Compare November 27, 2024 20:21

wujingyue changed the title ~~Fix communication lowering to support DID loop parallelization.~~ Allgather with DID loop split Nov 27, 2024

wujingyue added 2 commits November 27, 2024 16:07

Try to reuse getShardedAxis

5d60dd5

Fix lint

7ccfccd

wujingyue force-pushed the wjy/comm branch from f0fbaab to 7ccfccd Compare November 28, 2024 00:07

wujingyue marked this pull request as ready for review November 28, 2024 00:08

wujingyue commented Nov 28, 2024

View reviewed changes

csrc/multidevice/utils.cpp Show resolved Hide resolved

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

csrc/multidevice/utils.cpp Show resolved Hide resolved

tests/cpp/multidevice.cpp Show resolved Hide resolved

wujingyue requested review from cowanmeg, samnordmann and naoyam November 28, 2024 00:13

Base automatically changed from wjy/forward to main November 30, 2024 05:28

Merge remote-tracking branch 'origin/main' into wjy/comm

39f2809

wujingyue mentioned this pull request Dec 2, 2024

DID-parallelize a loop split in Python. #3503

Merged

samnordmann reviewed Dec 3, 2024

View reviewed changes

cowanmeg reviewed Dec 4, 2024

View reviewed changes

Disallow DID on inner splits

7cf2384

wujingyue mentioned this pull request Dec 5, 2024

Harden assertBuffersHaveSameSize to check shapes. #3531

Merged

Add a test for noncontiguous allgather

c27a585

wujingyue commented Dec 5, 2024

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

Comment

c218c70

wujingyue requested review from cowanmeg and samnordmann December 5, 2024 07:47

naoyam reviewed Dec 5, 2024

View reviewed changes

csrc/multidevice/utils.h Outdated Show resolved Hide resolved

Rename

9c2a218

wujingyue requested a review from naoyam December 5, 2024 17:39

wujingyue added 2 commits December 6, 2024 11:39

Comment

5229512

More comments

5157fe1

naoyam reviewed Dec 9, 2024

View reviewed changes

csrc/multidevice/utils.cpp Show resolved Hide resolved

naoyam approved these changes Dec 9, 2024

View reviewed changes

Merge branch 'main' into wjy/comm

6ad52dd

wujingyue merged commit 4a897a4 into main Dec 9, 2024
34 of 35 checks passed

wujingyue deleted the wjy/comm branch December 9, 2024 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allgather with DID loop split #3284

Allgather with DID loop split #3284

wujingyue commented Oct 25, 2024 •

edited

Loading

wujingyue commented Nov 28, 2024

wujingyue commented Nov 30, 2024

wujingyue commented Dec 2, 2024

samnordmann left a comment •

edited

Loading

samnordmann Dec 3, 2024

samnordmann Dec 3, 2024

wujingyue Dec 5, 2024

samnordmann Dec 3, 2024

wujingyue Dec 5, 2024

samnordmann Dec 6, 2024

wujingyue Dec 6, 2024 •

edited

Loading

samnordmann Dec 3, 2024 •

edited

Loading

samnordmann Dec 3, 2024 •

edited

Loading

wujingyue Dec 4, 2024

wujingyue Dec 5, 2024

wujingyue Dec 5, 2024

wujingyue Dec 5, 2024

cowanmeg Dec 4, 2024

cowanmeg Dec 4, 2024

naoyam Dec 4, 2024

wujingyue Dec 4, 2024

wujingyue Dec 5, 2024

samnordmann commented Dec 4, 2024

wujingyue commented Dec 4, 2024

naoyam left a comment

wujingyue commented Dec 9, 2024



	std::vector<int64_t> ref_in_tensor_shape = {kTensorSize};
	EXPECT_EQ(in_tensor.sizes(), ref_in_tensor_shape);

Allgather with DID loop split #3284

Allgather with DID loop split #3284

Conversation

wujingyue commented Oct 25, 2024 • edited Loading

wujingyue commented Nov 28, 2024

wujingyue commented Nov 30, 2024

wujingyue commented Dec 2, 2024

samnordmann left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wujingyue Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

samnordmann Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

samnordmann Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samnordmann commented Dec 4, 2024

wujingyue commented Dec 4, 2024

naoyam left a comment

Choose a reason for hiding this comment

wujingyue commented Dec 9, 2024

wujingyue commented Oct 25, 2024 •

edited

Loading

samnordmann left a comment •

edited

Loading

wujingyue Dec 6, 2024 •

edited

Loading

samnordmann Dec 3, 2024 •

edited

Loading

samnordmann Dec 3, 2024 •

edited

Loading