Decompose large `simdblockread` to smaller `simdblockread`s #2193

whitneywhtsang · 2024-09-11T05:01:44Z

<64xi16> simdblockread is not supported by OpenCL C builtins, decompose it to 8 x <8xi16>.
Restrict TritonGEN::SIMDBlockReadOp to only accept vector types that are allowed by OpenCL C builtins.

whitneywhtsang · 2024-09-11T05:26:07Z

Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805431155
After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805299931
Performance of SLM path improves with this PR.
Performance of SLM path looks slower than my expectation, @quintinwang5 what's the expected performance of SLM path?

quintinwang5 · 2024-09-11T06:16:11Z

Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805431155 After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805299931 Performance of SLM path improves with this PR. Performance of SLM path looks slower than my expectation, @quintinwang5 what's the expected performance of SLM path?

local data (SLM path):

attn-performance:
      Z     H    N_CTX  D_HEAD  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0   1.0  32.0  16384.0    64.0    10.382262   16.149019        10.382262       16.149019        10.382262       16.149019      85.051490    132.292763          85.051490        132.292763          85.051490        132.292763        NaN       NaN
1   2.0  32.0   8192.0    64.0    20.954494   32.180948        20.954494       32.180948        20.954494       32.180948      85.829609    131.813164          85.829609        131.813164          85.829609        131.813164        NaN       NaN
2   4.0  32.0   4096.0    64.0    42.069249   61.615244        42.069249       61.615244        42.069249       61.615244      86.157822    126.188019          86.157822        126.188019          86.157822        126.188019        NaN       NaN
3   4.0  48.0   1024.0    64.0   170.200363  212.119221       150.046645      212.119221       175.054429      212.119221      87.142586    108.605041          76.823882        108.605041          89.627868        108.605041   0.036618       NaN
4   8.0  32.0   2048.0    64.0    88.815334  121.433232        88.352289      121.433232        89.283250      121.433232      90.946902    124.347630          90.472744        124.347630          91.426048        124.347630   0.007412       NaN
5  16.0  32.0   1024.0    64.0   170.390493  228.915493       155.531799      228.915493       174.944903      228.915493      87.239932    117.204732          79.632281        117.204732          89.571790        117.204732   0.002092       NaN
6  32.0  32.0    512.0    64.0   260.522518  394.943885       247.597618      394.943885       268.908737      394.943885      66.693765    101.105635          63.384990        101.105635          68.840637        101.105635   0.011547

local data(Advanced path):

attn-performance:
      Z     H    N_CTX  D_HEAD  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0   1.0  32.0  16384.0    64.0    11.204830   16.285712        11.204830       16.285712        11.204830       16.285712      91.789971    133.412550          91.789971        133.412550          91.789971        133.412550        NaN       NaN
1   2.0  32.0   8192.0    64.0    22.544568   31.411536        22.544568       31.411536        22.544568       31.411536      92.342552    128.661653          92.342552        128.661653          92.342552        128.661653        NaN       NaN
2   4.0  32.0   4096.0    64.0    46.296022   61.150378        46.296022       61.150378        46.296022       61.150378      94.814253    125.235975          94.814253        125.235975          94.814253        125.235975        NaN       NaN
3   4.0  48.0   1024.0    64.0   193.930421  215.904466       174.520281      215.904466       198.719406      215.904466      99.292376    110.543087          89.354384        110.543087         101.744336        110.543087   0.029862       NaN
4   8.0  32.0   2048.0    64.0    87.181541  120.266775        83.949038      120.266775        99.085848      120.266775      89.273898    123.153178          85.963815        123.153178         101.463908        123.153178        NaN       NaN
5  16.0  32.0   1024.0    64.0   176.831840  229.793382       166.572830      229.793382       188.001076      229.793382      90.537902    117.654212          85.285289        117.654212          96.256551        117.654212   0.044558       NaN
6  32.0  32.0    512.0    64.0   278.213738  389.986436       252.137290      389.986436       286.007771      389.986436      71.222717     99.836528          64.547146         99.836528          73.217989         99.836528   0.011259       NaN

SLM path should be about 10 TFLOPS less than advanced path.
At least, you need the following environment variables to get relative good performance in advanced(and SLM) path:

export TRITON_INTEL_ENABLE_INSTR_SCHED=1
export TRITON_INTEL_ENABLE_FAST_PREFETCH=1

Dewei-Wang-sh

by default, we do not want enable first load to slm.

Dewei-Wang-sh · 2024-09-11T06:30:17Z

please get me informed for what happened in advanced path!!!!

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp

whitneywhtsang · 2024-09-11T13:38:41Z

by default, we do not want enable first load to slm.

There is no plan to enable first load to slm by default, this a draft PR, two of the commits are used for testing, as I am changing the lowering for SLM path, so enabled it to ensure it works as expected in the meantime.

whitneywhtsang · 2024-09-11T13:39:04Z

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

Dewei-Wang-sh · 2024-09-12T01:23:14Z

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

whitneywhtsang · 2024-09-12T01:34:52Z

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

Dewei-Wang-sh · 2024-09-12T01:46:49Z

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

Thanks.
FYI, I'm the one who proposed to use SLM and SIMDBLOCKload/store during the FA.

whitneywhtsang · 2024-09-12T01:47:54Z

Baseline CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10821106789
This PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10820043424
Results within the margin of error on SLM path.

whitneywhtsang · 2024-09-12T01:51:04Z

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

Thanks. FYI, I'm the one who proposed to use SLM and SIMDBLOCKload/store during the FA.

Sounds good. Just to clarify, I meant the idea of this PR, i.e., decompose simdblockread to OpenCL C builtin supporting sizes, came from Victor.

Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang · 2024-09-12T02:23:42Z

Will do a similar change to simdblockwrite if this change looks good to reviewers.

third_party/intel/lib/Dialect/TritonGEN/IR/TritonGENOps.cpp

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp

etiotto · 2024-09-12T14:53:17Z

@Dewei-Wang-sh looks like you asked for changes for this PR. Does it now look OK to you as well ?

Signed-off-by: Whitney Tsang <[email protected]>

…/splitsimdblock

whitneywhtsang · 2024-09-13T04:12:24Z

Will do a similar change to simdblockwrite if this change looks good to reviewers.

Done in #2227.

Dewei-Wang-sh

lgtm

…/splitsimdblock

Signed-off-by: Whitney Tsang <[email protected]>

Similar to #2193, but for `simdblockwrite`. Decompose `simdblockwrite` of vector size > 8 to a number of `simdblockwrite`s of vector size 8. e.g., `<64xi16>` `simdblockwrite` is not supported by OpenCL C builtins, decompose it to `8 x <8xi16>`. Restrict `TritonGEN::SIMDBlockWriteOp` to only accept vector types that are allowed by OpenCL C builtins. --------- Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang requested review from etiotto, quintinwang5 and a team September 11, 2024 05:01

whitneywhtsang self-assigned this Sep 11, 2024

whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from c58c8b1 to 638de27 Compare September 11, 2024 05:27

Dewei-Wang-sh self-requested a review September 11, 2024 06:25

Dewei-Wang-sh requested changes Sep 11, 2024

View reviewed changes

victor-eds reviewed Sep 11, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp Outdated Show resolved Hide resolved

vlad-penkin linked an issue Sep 11, 2024 that may be closed by this pull request

Decompose large simdblockread to smaller simdblockreads #2205

Closed

whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from 638de27 to 39b31a2 Compare September 11, 2024 16:20

whitneywhtsang mentioned this pull request Sep 11, 2024

[TritonGEN] Use OCL builtins for subgroup block read/write #2178

Merged

Base automatically changed from whitneywhtsang/simdblock to llvm-target September 11, 2024 16:21

whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch 2 times, most recently from 28c7187 to 4bb57ac Compare September 11, 2024 20:59

whitneywhtsang requested a review from victor-eds September 11, 2024 20:59

whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from 9d91486 to ab3a173 Compare September 11, 2024 23:12

whitneywhtsang added 4 commits September 12, 2024 02:19

Decompose large simdblockread to smaller simdblockreads

b2e9378

Signed-off-by: Whitney Tsang <[email protected]>

address review comments

362e4ba

Signed-off-by: Whitney Tsang <[email protected]>

Update lit test

cf3c5c3

Signed-off-by: Whitney Tsang <[email protected]>

Restrict TritonGEN::SIMDBlockReadOp

ea9edbd

Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from ab3a173 to ea9edbd Compare September 12, 2024 02:19

whitneywhtsang marked this pull request as ready for review September 12, 2024 02:22

whitneywhtsang requested a review from Dewei-Wang-sh September 12, 2024 02:22

victor-eds approved these changes Sep 12, 2024

View reviewed changes

third_party/intel/lib/Dialect/TritonGEN/IR/TritonGENOps.cpp Show resolved Hide resolved

etiotto approved these changes Sep 12, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp Outdated Show resolved Hide resolved

whitneywhtsang added 3 commits September 12, 2024 16:05

address review comments

0836791

Signed-off-by: Whitney Tsang <[email protected]>

minor changes

0cf05a9

Signed-off-by: Whitney Tsang <[email protected]>

Merge remote-tracking branch 'origin/llvm-target' into whitneywhtsang…

71afe7a

…/splitsimdblock

whitneywhtsang mentioned this pull request Sep 13, 2024

Decompose large simdblockwrite to smaller simdblockwrites #2227

Merged

whitneywhtsang closed this Sep 13, 2024

whitneywhtsang reopened this Sep 13, 2024

Dewei-Wang-sh approved these changes Sep 13, 2024

View reviewed changes

whitneywhtsang added 2 commits September 13, 2024 14:42

Merge remote-tracking branch 'origin/llvm-target' into whitneywhtsang…

89721cc

…/splitsimdblock

fix build

2282a9b

Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang enabled auto-merge (squash) September 13, 2024 15:00

whitneywhtsang merged commit acec47a into llvm-target Sep 13, 2024
4 checks passed

whitneywhtsang deleted the whitneywhtsang/splitsimdblock branch September 13, 2024 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompose large `simdblockread` to smaller `simdblockread`s #2193

Decompose large `simdblockread` to smaller `simdblockread`s #2193

whitneywhtsang commented Sep 11, 2024 •

edited

Loading

whitneywhtsang commented Sep 11, 2024 •

edited

Loading

quintinwang5 commented Sep 11, 2024 •

edited

Loading

Dewei-Wang-sh left a comment

Dewei-Wang-sh commented Sep 11, 2024

whitneywhtsang commented Sep 11, 2024

whitneywhtsang commented Sep 11, 2024

Dewei-Wang-sh commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024

Dewei-Wang-sh commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024 •

edited

Loading

whitneywhtsang commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024

etiotto commented Sep 12, 2024

whitneywhtsang commented Sep 13, 2024

Dewei-Wang-sh left a comment

Decompose large simdblockread to smaller simdblockreads #2193

Decompose large simdblockread to smaller simdblockreads #2193

Conversation

whitneywhtsang commented Sep 11, 2024 • edited Loading

whitneywhtsang commented Sep 11, 2024 • edited Loading

quintinwang5 commented Sep 11, 2024 • edited Loading

Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

Dewei-Wang-sh commented Sep 11, 2024

whitneywhtsang commented Sep 11, 2024

whitneywhtsang commented Sep 11, 2024

Dewei-Wang-sh commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024

Dewei-Wang-sh commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024 • edited Loading

whitneywhtsang commented Sep 12, 2024

whitneywhtsang commented Sep 12, 2024

etiotto commented Sep 12, 2024

whitneywhtsang commented Sep 13, 2024

Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

Decompose large `simdblockread` to smaller `simdblockread`s #2193

Decompose large `simdblockread` to smaller `simdblockread`s #2193

whitneywhtsang commented Sep 11, 2024 •

edited

Loading

whitneywhtsang commented Sep 11, 2024 •

edited

Loading

quintinwang5 commented Sep 11, 2024 •

edited

Loading

whitneywhtsang commented Sep 12, 2024 •

edited

Loading