Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompose large simdblockread to smaller simdblockreads #2193

Merged
merged 9 commits into from
Sep 13, 2024

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Sep 11, 2024

<64xi16> simdblockread is not supported by OpenCL C builtins, decompose it to 8 x <8xi16>.
Restrict TritonGEN::SIMDBlockReadOp to only accept vector types that are allowed by OpenCL C builtins.

@whitneywhtsang whitneywhtsang requested review from etiotto, quintinwang5 and a team September 11, 2024 05:01
@whitneywhtsang whitneywhtsang self-assigned this Sep 11, 2024
@whitneywhtsang
Copy link
Contributor Author

whitneywhtsang commented Sep 11, 2024

Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805431155
After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805299931
Performance of SLM path improves with this PR.
Performance of SLM path looks slower than my expectation, @quintinwang5 what's the expected performance of SLM path?

@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from c58c8b1 to 638de27 Compare September 11, 2024 05:27
@quintinwang5
Copy link
Contributor

quintinwang5 commented Sep 11, 2024

Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805431155 After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10805299931 Performance of SLM path improves with this PR. Performance of SLM path looks slower than my expectation, @quintinwang5 what's the expected performance of SLM path?

local data (SLM path):

attn-performance:
      Z     H    N_CTX  D_HEAD  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0   1.0  32.0  16384.0    64.0    10.382262   16.149019        10.382262       16.149019        10.382262       16.149019      85.051490    132.292763          85.051490        132.292763          85.051490        132.292763        NaN       NaN
1   2.0  32.0   8192.0    64.0    20.954494   32.180948        20.954494       32.180948        20.954494       32.180948      85.829609    131.813164          85.829609        131.813164          85.829609        131.813164        NaN       NaN
2   4.0  32.0   4096.0    64.0    42.069249   61.615244        42.069249       61.615244        42.069249       61.615244      86.157822    126.188019          86.157822        126.188019          86.157822        126.188019        NaN       NaN
3   4.0  48.0   1024.0    64.0   170.200363  212.119221       150.046645      212.119221       175.054429      212.119221      87.142586    108.605041          76.823882        108.605041          89.627868        108.605041   0.036618       NaN
4   8.0  32.0   2048.0    64.0    88.815334  121.433232        88.352289      121.433232        89.283250      121.433232      90.946902    124.347630          90.472744        124.347630          91.426048        124.347630   0.007412       NaN
5  16.0  32.0   1024.0    64.0   170.390493  228.915493       155.531799      228.915493       174.944903      228.915493      87.239932    117.204732          79.632281        117.204732          89.571790        117.204732   0.002092       NaN
6  32.0  32.0    512.0    64.0   260.522518  394.943885       247.597618      394.943885       268.908737      394.943885      66.693765    101.105635          63.384990        101.105635          68.840637        101.105635   0.011547   

local data(Advanced path):

attn-performance:
      Z     H    N_CTX  D_HEAD  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0   1.0  32.0  16384.0    64.0    11.204830   16.285712        11.204830       16.285712        11.204830       16.285712      91.789971    133.412550          91.789971        133.412550          91.789971        133.412550        NaN       NaN
1   2.0  32.0   8192.0    64.0    22.544568   31.411536        22.544568       31.411536        22.544568       31.411536      92.342552    128.661653          92.342552        128.661653          92.342552        128.661653        NaN       NaN
2   4.0  32.0   4096.0    64.0    46.296022   61.150378        46.296022       61.150378        46.296022       61.150378      94.814253    125.235975          94.814253        125.235975          94.814253        125.235975        NaN       NaN
3   4.0  48.0   1024.0    64.0   193.930421  215.904466       174.520281      215.904466       198.719406      215.904466      99.292376    110.543087          89.354384        110.543087         101.744336        110.543087   0.029862       NaN
4   8.0  32.0   2048.0    64.0    87.181541  120.266775        83.949038      120.266775        99.085848      120.266775      89.273898    123.153178          85.963815        123.153178         101.463908        123.153178        NaN       NaN
5  16.0  32.0   1024.0    64.0   176.831840  229.793382       166.572830      229.793382       188.001076      229.793382      90.537902    117.654212          85.285289        117.654212          96.256551        117.654212   0.044558       NaN
6  32.0  32.0    512.0    64.0   278.213738  389.986436       252.137290      389.986436       286.007771      389.986436      71.222717     99.836528          64.547146         99.836528          73.217989         99.836528   0.011259       NaN

SLM path should be about 10 TFLOPS less than advanced path.
At least, you need the following environment variables to get relative good performance in advanced(and SLM) path:

export TRITON_INTEL_ENABLE_INSTR_SCHED=1
export TRITON_INTEL_ENABLE_FAST_PREFETCH=1

@Dewei-Wang-sh Dewei-Wang-sh self-requested a review September 11, 2024 06:25
Copy link
Contributor

@Dewei-Wang-sh Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, we do not want enable first load to slm.

@Dewei-Wang-sh
Copy link
Contributor

please get me informed for what happened in advanced path!!!!

@vlad-penkin vlad-penkin linked an issue Sep 11, 2024 that may be closed by this pull request
@whitneywhtsang
Copy link
Contributor Author

by default, we do not want enable first load to slm.

There is no plan to enable first load to slm by default, this a draft PR, two of the commits are used for testing, as I am changing the lowering for SLM path, so enabled it to ensure it works as expected in the meantime.

@whitneywhtsang
Copy link
Contributor Author

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from 638de27 to 39b31a2 Compare September 11, 2024 16:20
Base automatically changed from whitneywhtsang/simdblock to llvm-target September 11, 2024 16:21
@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch 2 times, most recently from 28c7187 to 4bb57ac Compare September 11, 2024 20:59
@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from 9d91486 to ab3a173 Compare September 11, 2024 23:12
@Dewei-Wang-sh
Copy link
Contributor

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

@whitneywhtsang
Copy link
Contributor Author

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

@Dewei-Wang-sh
Copy link
Contributor

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

Thanks.
FYI, I'm the one who proposed to use SLM and SIMDBLOCKload/store during the FA.

@whitneywhtsang
Copy link
Contributor Author

whitneywhtsang commented Sep 12, 2024

@whitneywhtsang
Copy link
Contributor Author

please get me informed for what happened in advanced path!!!!

Of course, this is a draft PR, not ready for review.

but you requested others for review...

Mainly want to get early feedback from Victor, as the idea came from him. And Quintin, as the results of FA with SLM path didn't look right. Anyways, I will take extra caution to include you on advanced path changes, or remember to not request any reviewers for draft PR.

Thanks. FYI, I'm the one who proposed to use SLM and SIMDBLOCKload/store during the FA.

Sounds good. Just to clarify, I meant the idea of this PR, i.e., decompose simdblockread to OpenCL C builtin supporting sizes, came from Victor.

@whitneywhtsang whitneywhtsang force-pushed the whitneywhtsang/splitsimdblock branch from ab3a173 to ea9edbd Compare September 12, 2024 02:19
@whitneywhtsang whitneywhtsang marked this pull request as ready for review September 12, 2024 02:22
@whitneywhtsang
Copy link
Contributor Author

Will do a similar change to simdblockwrite if this change looks good to reviewers.

@etiotto
Copy link
Contributor

etiotto commented Sep 12, 2024

@Dewei-Wang-sh looks like you asked for changes for this PR. Does it now look OK to you as well ?

@whitneywhtsang
Copy link
Contributor Author

Will do a similar change to simdblockwrite if this change looks good to reviewers.

Done in #2227.

Copy link
Contributor

@Dewei-Wang-sh Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@whitneywhtsang whitneywhtsang enabled auto-merge (squash) September 13, 2024 15:00
@whitneywhtsang whitneywhtsang merged commit acec47a into llvm-target Sep 13, 2024
4 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/splitsimdblock branch September 13, 2024 16:14
etiotto pushed a commit that referenced this pull request Sep 13, 2024
Similar to
#2193, but for
`simdblockwrite`.
Decompose `simdblockwrite` of vector size > 8 to a number of
`simdblockwrite`s of vector size 8.
e.g., `<64xi16>` `simdblockwrite` is not supported by OpenCL C builtins,
decompose it to `8 x <8xi16>`.
Restrict `TritonGEN::SIMDBlockWriteOp` to only accept vector types that
are allowed by OpenCL C builtins.

---------

Signed-off-by: Whitney Tsang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decompose large simdblockread to smaller simdblockreads
5 participants