ReduceScatter with DID loop split #3504

wujingyue · 2024-12-02T06:22:48Z

wujingyue · 2024-12-02T06:28:31Z

tests/python/test_communication.py

+
+
+@pytest.mark.mpi
+def test_allreduce(mpi_test):


This allreduce test is merely for DID logical split. I don't think allreduce can support DID loop split because sum's reduction axes can only be logical. But I'd be happy to know otherwise.

This allreduce test is merely for DID logical split.

Just to be clear, you meant DID parallelization of logical domains, right? I'm not sure what you meant by DID logical split otherwise.

Assuming I understand what you meant correctly, I think this is where TensorView::rFactor could be used. That's what we use for intra-device hierarchical reductions. For example, I'd think that for multi-GPU reductions, we would have something like:

(I'm mixing the C++ and Python APIs)

self.out->split(0, num_devices, /*inner=*/false); auto intermediate_result = self.out->rFactor({1}); intermediate_result->axis(0)->parallelize(DIDx); self.out->axis(0)->parallelize(DIDx);

Here, intermediate_result would be the partial result of per-device reduction, which would be then reduced between all the devices and saved to self.out.

I agree it's something like rfactor, and did look into how TensorView::rfactor works in

Fuser/tests/cpp/test_tutorial.cpp

Line 344 in ecabd46

TensorView* tv2 = tv1_copy->rFactor({0});

. However, I failed to see how it applies here.

If we want to loop (but not logical) split an allreduce, the input would be a logical shape like [D*2,3] and the output would be of logical shape like [2,3]. Regardless of scheduling, what ops in fusion IR could do that? (Not a sum because that reduces an entire dimension to 1).

Let's talk offline. It seems we are not using the same vocabulary (e.g., I don't understand what "loop split" and "logical split" mean).

#3543 is my failed attempt. It triggered an assertion at

Fuser/csrc/scheduler/vectorize_helper.cpp

Line 1063 in 9346c8f

NVF_THROW("Unexpected producer RF ID: ", producer_rf_id->toString())

. Is it because code there has been assuming that reductions are innermost?

Anyhow, this isn't a blocker. As we discussed yesterday, we'll probably stick with logical split for reductions in Allreduce and ReduceScatter due to MatmulOp's implementation.

wujingyue · 2024-12-02T06:29:27Z

!test

csrc/multidevice/communication.cpp

naoyam

LGTM

For #2563

wujingyue · 2024-12-10T18:18:49Z

!test

wujingyue · 2024-12-10T21:13:26Z

!test

wujingyue · 2024-12-10T23:16:39Z

!test

wujingyue · 2024-12-11T01:29:22Z

!build

wujingyue commented Dec 2, 2024

View reviewed changes

wujingyue requested review from cowanmeg, samnordmann and naoyam December 2, 2024 06:42

wujingyue mentioned this pull request Dec 5, 2024

Allgather with DID loop split #3284

Merged

wujingyue commented Dec 7, 2024

View reviewed changes

csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved

naoyam approved these changes Dec 9, 2024

View reviewed changes

wujingyue force-pushed the wjy/split branch from 860791e to 14ee06a Compare December 9, 2024 23:35

Base automatically changed from wjy/split to main December 10, 2024 15:32

wujingyue force-pushed the wjy/rs branch from af7dd68 to 6d03163 Compare December 10, 2024 15:47

wujingyue added 2 commits December 10, 2024 07:56

ReduceScatter with DID loop split

0f41f46

For #2563

Improve error message

66a3363

wujingyue force-pushed the wjy/rs branch from 6d03163 to 66a3363 Compare December 10, 2024 15:56

Harden the test

b87b5f5

Reduction axis doesn't have to be 0.

5017124

Fewer checks

603a7ab

Remove an inaccurate comment

850dc2b

wujingyue merged commit d178c2a into main Dec 11, 2024
12 of 13 checks passed

wujingyue deleted the wjy/rs branch December 11, 2024 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReduceScatter with DID loop split #3504

ReduceScatter with DID loop split #3504

wujingyue commented Dec 2, 2024 •

edited

Loading

wujingyue Dec 2, 2024

naoyam Dec 3, 2024

naoyam Dec 3, 2024

wujingyue Dec 5, 2024

naoyam Dec 5, 2024

wujingyue Dec 7, 2024

wujingyue Dec 7, 2024

wujingyue commented Dec 2, 2024

naoyam left a comment

wujingyue commented Dec 10, 2024

wujingyue commented Dec 10, 2024

wujingyue commented Dec 10, 2024

wujingyue commented Dec 11, 2024



		@pytest.mark.mpi
		def test_allreduce(mpi_test):

ReduceScatter with DID loop split #3504

ReduceScatter with DID loop split #3504

Conversation

wujingyue commented Dec 2, 2024 • edited Loading

wujingyue Dec 2, 2024

Choose a reason for hiding this comment

naoyam Dec 3, 2024

Choose a reason for hiding this comment

naoyam Dec 3, 2024

Choose a reason for hiding this comment

wujingyue Dec 5, 2024

Choose a reason for hiding this comment

naoyam Dec 5, 2024

Choose a reason for hiding this comment

wujingyue Dec 7, 2024

Choose a reason for hiding this comment

wujingyue Dec 7, 2024

Choose a reason for hiding this comment

wujingyue commented Dec 2, 2024

naoyam left a comment

Choose a reason for hiding this comment

wujingyue commented Dec 10, 2024

wujingyue commented Dec 10, 2024

wujingyue commented Dec 10, 2024

wujingyue commented Dec 11, 2024

wujingyue commented Dec 2, 2024 •

edited

Loading