Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Stride MatmulOp according to set allocation domain #3447

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Priya2698
Copy link
Collaborator

@Priya2698 Priya2698 commented Nov 19, 2024

Resolves Issue #2427.
Restrides the output of MatmulOp::evaluate according to the stride order set from python frontend (fd.ops.add_output/fd.ops.stride_order).

@Priya2698 Priya2698 force-pushed the pm/matmul_stride_order branch 2 times, most recently from df4e235 to b4bd2bf Compare November 28, 2024 16:25
@Priya2698
Copy link
Collaborator Author

!test

@Priya2698
Copy link
Collaborator Author

!test

@Priya2698 Priya2698 marked this pull request as ready for review December 2, 2024 17:06
@Priya2698 Priya2698 requested review from jjsjann123 and wujingyue and removed request for jjsjann123 December 2, 2024 17:06
return {at::matmul(a, b)};
auto matmul_out = at::matmul(a, b);
if (out()->hasAllocation()) {
auto matmul_sizes = matmul_out.sizes().vec();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing computeStrides to take c10::IntArrayRef? .vec() copies.

if (out()->hasAllocation()) {
auto matmul_sizes = matmul_out.sizes().vec();
auto strides = computeStrides(out(), matmul_sizes);
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);
Copy link
Collaborator

@wujingyue wujingyue Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong. IIUC, as_strided creates a view of the input. So if the input tensor is in the wrong memory format there's no way to "relayout" the storage using as_strided. I think you would have caught this problem if your test had verified the content (not just the shape) of the output tensor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! you need an explicit copy here instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a better way than copy. I suspect matmul_out respects the output strides, but please double check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. if we can feed our own outputs there, we might as well do that. So you need to call as_strided on output before executing the kernel.

BTW, aten may or may not do that in a fused kernel. i.e. https://github.com/pytorch/pytorch/blob/1f3d8896bc9cea7f46c50ff92b69c6aa139defcb/aten/src/ATen/native/LinearAlgebra.cpp#L2097-L2099

I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead... 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead...

It's hard to be performance optimal given the implementation of at::matmul_out can change without notifying us. I'd check for the cases that we care about. I suspect for all or 90% of them at::matmul_out is able to fuse. If so, I'd unconditionally assume it'll fuse for simplicity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you're right. I misunderstood this function call -- there is no copy here. Will fix this to use matmul_out or an explicit copy.

auto strides = computeStrides(out(), matmul_sizes);
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);
}
inferAndValidateAllocationSizesAndStrides(matmul_out, out(), ee);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about validating output allocation for all MatmulOps.

  1. We already validate allocation sizes/strides for each segment's inputs and outputs. Given MatmulOp currently forms its own segment, existing validation seems enough.
  2. If/When MatmulOp produces an internal tensor, we can't always materialize the tensor as an at::Tensor that matches its allocation domain. For example, the allocation domain can be a split and/or a swizzle of logical. Assuming allocation is a permutation of logical is probably OK for segment inputs/outputs, but can be too limiting for internal tensors. cc @zasdfgbnm

@@ -757,4 +757,8 @@ std::vector<IterDomain*> strideOrderToAllocation(
const std::vector<IterDomain*>& logical_domain,
const std::vector<int64_t>& stride_order);

std::vector<int64_t> computeStrides(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we forgot this declaration?

out = fd.execute(inputs)
verify_stride_order(out[0].stride(), perm)
# Verify that setting the stride order does not change the logical shape
self.assertEqual(out[0].shape, torch.Size([b, b, m, n]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to validate the result here. That's probably why the issue @wujingyue raised didn't get caught.

if (out()->hasAllocation()) {
auto matmul_sizes = matmul_out.sizes().vec();
auto strides = computeStrides(out(), matmul_sizes);
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! you need an explicit copy here instead.

@Priya2698 Priya2698 marked this pull request as draft December 3, 2024 09:09
@Priya2698 Priya2698 changed the title Stride MatmulOp according to set allocation domain [WIP] Stride MatmulOp according to set allocation domain Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants