-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DotOp layout for UpcastMXFPOp Lowering #3057
base: main
Are you sure you want to change the base?
Conversation
f03d182
to
151eee4
Compare
151eee4
to
8b0f018
Compare
// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout [[ARG0]] : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> | ||
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout [[ARG1]] : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]> | ||
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> | ||
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For E2M1, 2 fp4 elements are packed in 1 int8, when upcasting like <32x32xi8> to <32x64xbf16>, the output dot layout must have double-sized contiguous element access for each thread than the input dot layout.
Common code use kWidth
, like change 4 -> 8 for e2m1 to bf16. For Intel GPU DPAS layout, we can change the OpsPerChannel from 2 to 4 to meet the requirement and convert back with ConvertLayoutOp.
Input tensor layout:
#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 16], B = [16, 16], C = [8, 16]}>, kWidth = 2}>
[ T0:0, T1:0, ... T14:0, T15:0, T0:8, T1:8, ... T14:8, T15:8]
[ T0:1, T1:1, ... T14:1, T15:1, T0:9, T1:9, ... T14:9, T15:9]
...
[ T0:7, T1:7, ... T14:7, T15:7, T0:15, T1:15, ... T14:15, T15:15]
[ T16:0, T17:0, ... T30:0, T31:0, T16:8, T17:8, ... T30:8, T31:8]
...
[ T16:7, T17:7, ... T30:7, T31:7, T16:15, T17:15, ... T30:15, T31:15]
[ T32:0, T33:0, ... T46:0, T47:0, T32:8, T33:8, ... T46:8, T47:8]
...
[ T48:0, T49:0, ... T62:0, T63:0, T48:8, T49:8, ... T62:8, T63:8]
...
Output tensor layout:
#ttg.dot_op<{opIdx = 0, parent = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 4, threadsPerWarp = 16, warpsPerCTA = [4, 1], repCluster = [1, 1], A = [8, 32], B = [32, 16], C = [8, 16]}>, kWidth = 4}>
[ T0:0, T0:1, ... T15:0, T15:1, T0:16, T0:17, ... T15:16, T15:17]
[ T0:2, T0:3, ... T15:2, T15:3, T0:18, T0:19, ... T15:18, T15:19]
...
[ T0:14, T0:15, ... T15:14, T15:15, T0:30, T0:31, ... T15:30, T15:31]
[ T16:0, T16:1, ... T31:0, T31:1, T16:16, T16:17, ... T31:16, T31:17]
...
[ T16:14, T16:15, ... T31:14, T31:15, T16:30, T16:31, ... T31:30, T31:31]
[ T32:0, T32:1, ... T47:0, T47:1, T32:16, T32:17, ... T47:16, T47:17]
...
[ T48:0, T48:1, ... T63:0, T63:1, T48:16, T48:17, ... T63:16, T63:17]
...
// CHECK: [[CVT_ARG0:%.*]] = ttg.convert_layout %arg0 : tensor<128x32xi8, [[BLOCKED]]> -> tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> | ||
// CHECK: [[CVT_ARG1:%.*]] = ttg.convert_layout %arg1 : tensor<128x2xi8, [[BLOCKED1]]> -> tensor<128x2xi8, [[BLOCKED3]]> | ||
// CHECK: [[UPCAST:%.*]] = ttg.upcast_mxfp [[CVT_ARG0]], [[CVT_ARG1]] fp_type = e2m1 : tensor<128x32xi8, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>, tensor<128x2xi8, [[BLOCKED3]]> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> | ||
// CHECK: [[A:%.*]] = ttg.convert_layout [[UPCAST]] : tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS1]], kWidth = 4}>> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still ttg.convert_layout
?
Can we directly output the result type of ttg.upcast_mxfp
as tensor<128x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>>
? So that we can omit the convert layout operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, please see my explanation above.
We have to use i8 with OpsPerChan=2 and unpack to bf16 with OpsPerChan=4, then using ttg.convert_layout
to convert the bf16 tensor dot layout(OpsPerChan=4) back to operand A dot layout(OpsPerChan=2). Otherwise, the unpacked 2 fp4 will be placed to incorrect position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these conversions be preserved after running -tritonintelgpu-remove-layout-conversions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible way to shuffle the value from #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 4}>
to #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>
without SLM:
- bitcast <2xbfloat16> to uint32.
- inline vISA to change the uint32 type to <2xbfloat16> without register movement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these conversions be preserved after running
-tritonintelgpu-remove-layout-conversions
?
Yes, before applying the solution chengjun mentioned, I think this ttg.convert_layout
is necessary to make the result correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can keep the convert layout in this PR and future improve the codegen to not SLM by using inline vISA or new builtin or some other ways in another PR.
// CHECK: [[B:%.*]] = tt.fp_to_fp [[CVT_ARG0]] : tensor<64x128xf8E4M3FN, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> | ||
// CHECK: [[D:%.*]] = tt.dot [[A]], [[B]], [[C]] : tensor<32x64xbf16, #ttg.dot_op<{opIdx = 0, parent = [[DPAS]], kWidth = 2}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = [[DPAS]], kWidth = 2}>> -> tensor<32x128xf32, [[DPAS]]> | ||
// CHECK: [[RES:%.*]] = ttg.convert_layout [[D]] : tensor<32x128xf32, [[DPAS]]> -> tensor<32x128xf32, [[BLOCKED4]]> | ||
// CHECK: scf.yield [[RES]] : tensor<32x128xf32, [[BLOCKED4]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we do tranpose for dot_scaled operands before lowring to convert RHS UpcastMXFP to LHS UpcastMXFP, instread of directly implementing RHS UpcastMXFP.
In the case of RHS scaling with dot layout, each thread access elements in column which requires scaling values from threads across warps, however, we can only shuffle threads values in the same warp. So I kept the same logic with the upstream.
8b0f018
to
585f4e6
Compare
585f4e6
to
0fd0510
Compare
This pull request support dot layout codegen for upcast_mxfp operation, which could be more efficient than previous blocked layout implementation.
The 2 skipped tests are failed for L0 runtime error, they will be addressed in a seperate PR #2968.