Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DotOp layout for UpcastMXFPOp Lowering #3057

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

LiyangLingIntel
Copy link
Contributor

@LiyangLingIntel LiyangLingIntel commented Dec 20, 2024

This pull request support dot layout codegen for upcast_mxfp operation, which could be more efficient than previous blocked layout implementation.

The 2 skipped tests are failed for L0 runtime error, they will be addressed in a seperate PR #2968.

@LiyangLingIntel LiyangLingIntel linked an issue Dec 20, 2024 that may be closed by this pull request
@LiyangLingIntel LiyangLingIntel self-assigned this Dec 20, 2024
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from f03d182 to 151eee4 Compare January 3, 2025 05:52
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from 151eee4 to 8b0f018 Compare January 3, 2025 05:56
@LiyangLingIntel LiyangLingIntel changed the title [WIP] Use DotOp layout for UpcastMXFPOp Lowering Use DotOp layout for UpcastMXFPOp Lowering Jan 3, 2025
@LiyangLingIntel LiyangLingIntel marked this pull request as ready for review January 3, 2025 05:56
// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout [[ARG0]] : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout [[ARG1]] : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]>
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>>
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For E2M1, 2 fp4 elements are packed in 1 int8, when upcasting like <32x32xi8> to <32x64xbf16>, the output dot layout must have double-sized contiguous element access for each thread than the input dot layout.

Common code use kWidth, like change 4 -> 8 for e2m1 to bf16. For Intel GPU DPAS layout, we can change the OpsPerChannel from 2 to 4 to meet the requirement and convert back with ConvertLayoutOp.

Input tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 16], B = [16, 16], C = [8, 16]}>, kWidth = 2}>
[ T0:0, T1:0, ... T14:0, T15:0, T0:8, T1:8, ... T14:8, T15:8]
[ T0:1, T1:1, ... T14:1, T15:1, T0:9, T1:9, ... T14:9, T15:9]
...
[ T0:7, T1:7, ... T14:7, T15:7, T0:15, T1:15, ... T14:15, T15:15]
[ T16:0, T17:0, ... T30:0, T31:0, T16:8, T17:8, ... T30:8, T31:8]
...
[ T16:7, T17:7, ... T30:7, T31:7, T16:15, T17:15, ... T30:15, T31:15]
[ T32:0, T33:0, ... T46:0, T47:0, T32:8, T33:8, ... T46:8, T47:8]
...
[ T48:0, T49:0, ... T62:0, T63:0, T48:8, T49:8, ... T62:8, T63:8]
...

Output tensor layout:

#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 4, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 32], B = [32, 16], C = [8, 16]}>, kWidth = 4}>
[ T0:0, T0:1, ... T15:0, T15:1, T0:16, T0:17, ... T15:16, T15:17]
[ T0:2, T0:3, ... T15:2, T15:3, T0:18, T0:19, ... T15:18, T15:19]
...
[ T0:14, T0:15, ... T15:14, T15:15, T0:30, T0:31, ... T15:30, T15:31]
[ T16:0, T16:1, ... T31:0, T31:1, T16:16, T16:17, ... T31:16, T31:17]
...
[ T16:14, T16:15, ... T31:14, T31:15, T16:30, T16:31, ... T31:30, T31:31]
[ T32:0, T32:1, ... T47:0, T47:1, T32:16, T32:17, ... T47:16, T47:17]
...
[ T48:0, T48:1, ... T63:0, T63:1, T48:16, T48:17, ... T63:16, T63:17]
...

// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout %arg0 : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout %arg1 : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]>
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>>
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
Copy link
Contributor

@chengjunlu chengjunlu Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still ttg.convert_layout?

Can we directly output the result type of ttg.upcast_mxfp as tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>? So that we can omit the convert layout operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, please see my explanation above.
We have to use i8 with OpsPerChan=2 and unpack to bf16 with OpsPerChan=4, then using ttg.convert_layout to convert the bf16 tensor dot layout(OpsPerChan=4) back to operand A dot layout(OpsPerChan=2). Otherwise, the unpacked 2 fp4 will be placed to incorrect position.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these conversions be preserved after running -tritonintelgpu-remove-layout-conversions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible way to shuffle the value from #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 4}> to #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}> without SLM:

  1. bitcast <2xbfloat16> to uint32.
  2. inline vISA to change the uint32 type to <2xbfloat16> without register movement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these conversions be preserved after running -tritonintelgpu-remove-layout-conversions?

Yes, before applying the solution chengjun mentioned, I think this ttg.convert_layout is necessary to make the result correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can keep the convert layout in this PR and future improve the codegen to not SLM by using inline vISA or new builtin or some other ways in another PR.

// CHECK: [[B:%.*]] = tt.fp_to_fp [[CVT_ARG0]] : tensor<64x128xf8E4M3FN, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>>
// CHECK: [[D:%.*]] = tt.dot [[A]], [[B]], [[C]] : tensor<32x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<32x128xf32, [[DPAS]]>
// CHECK: [[RES:%.*]] = ttg.convert_layout [[D]] : tensor<32x128xf32, [[DPAS]]> -> tensor<32x128xf32, [[BLOCKED4]]>
// CHECK: scf.yield [[RES]] : tensor<32x128xf32, [[BLOCKED4]]>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do tranpose for dot_scaled operands before lowring to convert RHS UpcastMXFP to LHS UpcastMXFP, instread of directly implementing RHS UpcastMXFP.

In the case of RHS scaling with dot layout, each thread access elements in column which requires scaling values from threads across warps, however, we can only shuffle threads values in the same warp. So I kept the same logic with the upstream.

@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from 8b0f018 to 585f4e6 Compare January 3, 2025 06:58
@LiyangLingIntel LiyangLingIntel force-pushed the liyang/upcast_dot_layout branch from 585f4e6 to 0fd0510 Compare January 3, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance] Enhance tritongpu.upcast_mxfp with dot layout
4 participants