[WIP] Stride `MatmulOp` according to set allocation domain #3447

Priya2698 · 2024-11-19T12:35:19Z

Resolves Issue #2427.
Restrides the output of MatmulOp::evaluate according to the stride order set from python frontend (fd.ops.add_output/fd.ops.stride_order).

Priya2698 · 2024-11-28T16:51:21Z

!test

Priya2698 · 2024-11-29T14:10:03Z

!test

wujingyue · 2024-12-02T17:09:48Z

csrc/ir/nodes.cpp

-  return {at::matmul(a, b)};
+  auto matmul_out = at::matmul(a, b);
+  if (out()->hasAllocation()) {
+    auto matmul_sizes = matmul_out.sizes().vec();


How about changing computeStrides to take c10::IntArrayRef? .vec() copies.

wujingyue · 2024-12-02T17:16:38Z

csrc/ir/nodes.cpp

+  if (out()->hasAllocation()) {
+    auto matmul_sizes = matmul_out.sizes().vec();
+    auto strides = computeStrides(out(), matmul_sizes);
+    matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);


This feels wrong. IIUC, as_strided creates a view of the input. So if the input tensor is in the wrong memory format there's no way to "relayout" the storage using as_strided. I think you would have caught this problem if your test had verified the content (not just the shape) of the output tensor.

Agreed! you need an explicit copy here instead.

There may be a better way than copy. I suspect matmul_out respects the output strides, but please double check

Good call. if we can feed our own outputs there, we might as well do that. So you need to call as_strided on output before executing the kernel.

BTW, aten may or may not do that in a fused kernel. i.e. https://github.com/pytorch/pytorch/blob/1f3d8896bc9cea7f46c50ff92b69c6aa139defcb/aten/src/ATen/native/LinearAlgebra.cpp#L2097-L2099

I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead... 🤔

I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead...

It's hard to be performance optimal given the implementation of at::matmul_out can change without notifying us. I'd check for the cases that we care about. I suspect for all or 90% of them at::matmul_out is able to fuse. If so, I'd unconditionally assume it'll fuse for simplicity.

Ahh you're right. I misunderstood this function call -- there is no copy here. Will fix this to use matmul_out or an explicit copy.

wujingyue · 2024-12-02T17:32:18Z

csrc/ir/nodes.cpp

+    auto strides = computeStrides(out(), matmul_sizes);
+    matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);
+  }
+  inferAndValidateAllocationSizesAndStrides(matmul_out, out(), ee);


I'm not sure about validating output allocation for all MatmulOps.

We already validate allocation sizes/strides for each segment's inputs and outputs. Given MatmulOp currently forms its own segment, existing validation seems enough.

If/When MatmulOp produces an internal tensor, we can't always materialize the tensor as an at::Tensor that matches its allocation domain. For example, the allocation domain can be a split and/or a swizzle of logical. Assuming allocation is a permutation of logical is probably OK for segment inputs/outputs, but can be too limiting for internal tensors. cc @zasdfgbnm

jjsjann123 · 2024-12-02T20:02:28Z

csrc/ir/utils.h

@@ -757,4 +757,8 @@ std::vector<IterDomain*> strideOrderToAllocation(
    const std::vector<IterDomain*>& logical_domain,
    const std::vector<int64_t>& stride_order);

+std::vector<int64_t> computeStrides(


looks like we forgot this declaration?

jjsjann123 · 2024-12-02T20:03:40Z

tests/python/test_matmul.py

+            out = fd.execute(inputs)
+            verify_stride_order(out[0].stride(), perm)
+            # Verify that setting the stride order does not change the logical shape
+            self.assertEqual(out[0].shape, torch.Size([b, b, m, n]))


You forgot to validate the result here. That's probably why the issue @wujingyue raised didn't get caught.

jjsjann123 · 2024-12-02T20:04:22Z

csrc/ir/nodes.cpp

+  if (out()->hasAllocation()) {
+    auto matmul_sizes = matmul_out.sizes().vec();
+    auto strides = computeStrides(out(), matmul_sizes);
+    matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);


Agreed! you need an explicit copy here instead.

Priya2698 force-pushed the pm/matmul_stride_order branch 2 times, most recently from df4e235 to b4bd2bf Compare November 28, 2024 16:25

Priya2698 added 7 commits November 28, 2024 08:27

consider reduction axis in stride->allocation

255bd06

fix errors

d84fc18

move to utility function, tests

cba2de0

move verify fn to utils

527eeb8

remove extraneous changes

dfaa9d6

stride matmul out if allocation domain is specified

e27842c

move to tensor metadata, add validation check

2dee366

Priya2698 force-pushed the pm/matmul_stride_order branch from b4bd2bf to 2dee366 Compare November 28, 2024 16:27

Priya2698 added 3 commits November 28, 2024 08:38

remove redundant rebase changes

eb444da

rename, format

1c87fa9

format

d5b4c1a

Priya2698 and others added 2 commits November 29, 2024 06:02

move validation after restriding

988cf73

Merge branch 'main' into pm/matmul_stride_order

7dfe56b

Priya2698 marked this pull request as ready for review December 2, 2024 17:06

Priya2698 requested review from jjsjann123 and wujingyue and removed request for jjsjann123 December 2, 2024 17:06

wujingyue reviewed Dec 2, 2024

View reviewed changes

jjsjann123 requested changes Dec 2, 2024

View reviewed changes

Priya2698 marked this pull request as draft December 3, 2024 09:09

Priya2698 changed the title ~~Stride MatmulOp according to set allocation domain~~ [WIP] Stride MatmulOp according to set allocation domain Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stride `MatmulOp` according to set allocation domain #3447

[WIP] Stride `MatmulOp` according to set allocation domain #3447

Priya2698 commented Nov 19, 2024 •

edited

Loading

Priya2698 commented Nov 28, 2024

Priya2698 commented Nov 29, 2024

wujingyue Dec 2, 2024

wujingyue Dec 2, 2024 •

edited

Loading

jjsjann123 Dec 2, 2024

wujingyue Dec 2, 2024

jjsjann123 Dec 2, 2024

wujingyue Dec 3, 2024

Priya2698 Dec 3, 2024

wujingyue Dec 2, 2024

jjsjann123 Dec 2, 2024

jjsjann123 Dec 2, 2024

jjsjann123 Dec 2, 2024

[WIP] Stride MatmulOp according to set allocation domain #3447

Are you sure you want to change the base?

[WIP] Stride MatmulOp according to set allocation domain #3447

Conversation

Priya2698 commented Nov 19, 2024 • edited Loading

Priya2698 commented Nov 28, 2024

Priya2698 commented Nov 29, 2024

Choose a reason for hiding this comment

wujingyue Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[WIP] Stride `MatmulOp` according to set allocation domain #3447

[WIP] Stride `MatmulOp` according to set allocation domain #3447

Priya2698 commented Nov 19, 2024 •

edited

Loading

wujingyue Dec 2, 2024 •

edited

Loading