-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Stride MatmulOp
according to set allocation domain
#3447
base: main
Are you sure you want to change the base?
Conversation
df4e235
to
b4bd2bf
Compare
b4bd2bf
to
2dee366
Compare
!test |
!test |
return {at::matmul(a, b)}; | ||
auto matmul_out = at::matmul(a, b); | ||
if (out()->hasAllocation()) { | ||
auto matmul_sizes = matmul_out.sizes().vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing computeStrides to take c10::IntArrayRef? .vec()
copies.
if (out()->hasAllocation()) { | ||
auto matmul_sizes = matmul_out.sizes().vec(); | ||
auto strides = computeStrides(out(), matmul_sizes); | ||
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels wrong. IIUC, as_strided creates a view of the input. So if the input tensor is in the wrong memory format there's no way to "relayout" the storage using as_strided. I think you would have caught this problem if your test had verified the content (not just the shape) of the output tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! you need an explicit copy here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a better way than copy. I suspect matmul_out respects the output strides, but please double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. if we can feed our own outputs there, we might as well do that. So you need to call as_strided
on output before executing the kernel.
BTW, aten may or may not do that in a fused kernel. i.e. https://github.com/pytorch/pytorch/blob/1f3d8896bc9cea7f46c50ff92b69c6aa139defcb/aten/src/ATen/native/LinearAlgebra.cpp#L2097-L2099
I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more interesting question is, do we assume it's always a copy and should we try to expose this transpose/copy kernel in fusion for nvfuser to handle it instead...
It's hard to be performance optimal given the implementation of at::matmul_out can change without notifying us. I'd check for the cases that we care about. I suspect for all or 90% of them at::matmul_out is able to fuse. If so, I'd unconditionally assume it'll fuse for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh you're right. I misunderstood this function call -- there is no copy here. Will fix this to use matmul_out or an explicit copy.
auto strides = computeStrides(out(), matmul_sizes); | ||
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides); | ||
} | ||
inferAndValidateAllocationSizesAndStrides(matmul_out, out(), ee); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about validating output allocation for all MatmulOps.
- We already validate allocation sizes/strides for each segment's inputs and outputs. Given MatmulOp currently forms its own segment, existing validation seems enough.
- If/When MatmulOp produces an internal tensor, we can't always materialize the tensor as an
at::Tensor
that matches its allocation domain. For example, the allocation domain can be a split and/or a swizzle of logical. Assuming allocation is a permutation of logical is probably OK for segment inputs/outputs, but can be too limiting for internal tensors. cc @zasdfgbnm
@@ -757,4 +757,8 @@ std::vector<IterDomain*> strideOrderToAllocation( | |||
const std::vector<IterDomain*>& logical_domain, | |||
const std::vector<int64_t>& stride_order); | |||
|
|||
std::vector<int64_t> computeStrides( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we forgot this declaration?
out = fd.execute(inputs) | ||
verify_stride_order(out[0].stride(), perm) | ||
# Verify that setting the stride order does not change the logical shape | ||
self.assertEqual(out[0].shape, torch.Size([b, b, m, n])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to validate the result here. That's probably why the issue @wujingyue raised didn't get caught.
if (out()->hasAllocation()) { | ||
auto matmul_sizes = matmul_out.sizes().vec(); | ||
auto strides = computeStrides(out(), matmul_sizes); | ||
matmul_out = at::as_strided(matmul_out, matmul_sizes, strides); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! you need an explicit copy here instead.
MatmulOp
according to set allocation domainMatmulOp
according to set allocation domain
Resolves Issue #2427.
Restrides the output of
MatmulOp::evaluate
according to the stride order set from python frontend (fd.ops.add_output/fd.ops.stride_order
).