Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] attn supports f16 #26487

Merged
merged 30 commits into from
Oct 15, 2024
Merged

Conversation

xczhai
Copy link
Contributor

@xczhai xczhai commented Sep 9, 2024

Details:

  • rebase from [CPU] SDPA supports f16 #22939
  • enable avx512 fp16 for attention
  • enable amx fp16 for attention
  • update PagedAttentionExtension lightly. can specify the correct type to pa second output precision

Tickets:

  • 128183

@xczhai xczhai requested review from a team as code owners September 9, 2024 08:33
@xczhai xczhai marked this pull request as draft September 9, 2024 08:33
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin labels Sep 9, 2024
@xczhai xczhai force-pushed the xc/avx_fp16 branch 2 times, most recently from f65a8e1 to cc3d9cf Compare September 10, 2024 10:32
@yuxu42 yuxu42 mentioned this pull request Sep 23, 2024
@xczhai xczhai force-pushed the xc/avx_fp16 branch 2 times, most recently from 63700c7 to 667901f Compare September 24, 2024 06:08
@xczhai xczhai marked this pull request as ready for review September 24, 2024 09:41
src/plugins/intel_cpu/src/nodes/scaled_attn.cpp Outdated Show resolved Hide resolved
src/inference/src/system_conf.cpp Outdated Show resolved Hide resolved
src/inference/src/system_conf.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the category: inference OpenVINO Runtime library - Inference label Sep 27, 2024
@xczhai
Copy link
Contributor Author

xczhai commented Sep 27, 2024

@luo-cheng2021 Have updated the code according comments. Could you help review the codes, especially page attention part? Thanks.

@@ -193,9 +193,12 @@ void Graph::Replicate(const std::shared_ptr<const ov::Model> &model,
auto parentNode = op2node[unusedOutput.get_node_shared_ptr()];
const auto port = unusedOutput.get_index();
const auto nodeName = std::string("stub_") + std::to_string(unusedOutput.get_index()) + "_" + parentNode->getName();
// WA: avoid PagedAttention's second output reorder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better not to hardcode here and should find the place where the output precision is changed to f16.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates here. It has to be resolved before the merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates here. It has to be resolved before the merge.

still address it. The root cause is ConvertPrecision transformation. Need to deal with it carefully and avoid affecting gpu.

Copy link
Contributor Author

@xczhai xczhai Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov @luo-cheng2021
remove such wa.
The root cause is that the PagedAttentionExtension op is a bit hardcode. doesn't provide a mechanism to change the 2nd output dtype.
I make some changes.

  1. add a set_out_type member func to PagedAttentionExtension op.
  2. When execute validate_and_infer_types in PagedAttentionExtension, it will determine output type. It won't break the GPU path.
  3. add a fuse_type_to_pa in CPU plugin, which is a extend to ConvertPrecision. It is used to specify the correct type for PagedAttentionExtension 's 2nd output type. The scope is in CPU plugin and won't break the common pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xczhai Just for my better understanding: could you please descrive the pattern there Reorder is inserted? Like next op after PA expected fp16 on its input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xczhai Just for my better understanding: could you please descrive the pattern there Reorder is inserted? Like next op after PA expected fp16 on its input?

Okay.

  1. At the very beginning, the PA op spec describes the two outputs type is aligned with input0 type. As a result, PA's outputs type is f32 when entering CPU plugin.
  2. During CPU plugin transformation, ConvertPrecision will convert or fuse the op's type. As a result, PA's two outputs type is f16. But remember PA's 2nd output is dangle without any child and Result node.
  3. In construct graph, all the dangle output will be wrapped by Result node and the type is aligned with output type. In this case, the specific pattern is PA's 2nd output --> Result(f16)
  4. But in CPU node design, PA's 2nd output is always f32. So the pattern is PA's 2nd output(f32) --> Result(f16).
  5. The following ResolveConflict logic scan this pattern and then insert a Reorder. So the pattern becomes PA's 2nd output(f32) --> Reorder --> Result(f16)

@xczhai xczhai force-pushed the xc/avx_fp16 branch 2 times, most recently from ed88364 to 37479b2 Compare September 29, 2024 07:01
@yuxu42
Copy link
Contributor

yuxu42 commented Sep 30, 2024

Hi @dmitry-gorokhov Could you please take a review? Thanks!

@dmitry-gorokhov
Copy link
Contributor

@xczhai I am going to merge ARM SDPA FP16 first #26487. Lets rebase the PR after that.

@xczhai xczhai requested a review from a team as a code owner October 10, 2024 06:33
@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Oct 10, 2024
- rebase f16 impl from arm
- refactor the testcase for x64
@xczhai
Copy link
Contributor Author

xczhai commented Oct 12, 2024

@xczhai I am going to merge ARM SDPA FP16 first #26487. Lets rebase the PR after that.

rebase arm

@xczhai xczhai closed this Oct 12, 2024
@xczhai xczhai reopened this Oct 12, 2024

if ((inType == ElementType::bf16 && !ov::with_cpu_x86_bfloat16()) ||
(inType == ElementType::f16 && !ov::with_cpu_x86_avx512_core_fp16())) {
GTEST_SKIP();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test skip is still there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test skip is still there

remove

@xczhai xczhai force-pushed the xc/avx_fp16 branch 2 times, most recently from 93e1a24 to b9c9f4c Compare October 14, 2024 07:39
@dmitry-gorokhov dmitry-gorokhov added this to the 2024.5 milestone Oct 15, 2024
@dmitry-gorokhov dmitry-gorokhov self-assigned this Oct 15, 2024
@dmitry-gorokhov dmitry-gorokhov added this pull request to the merge queue Oct 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 15, 2024
@dmitry-gorokhov dmitry-gorokhov added this pull request to the merge queue Oct 15, 2024
Merged via the queue into openvinotoolkit:master with commit 9486b7d Oct 15, 2024
160 checks passed
@dmitry-gorokhov dmitry-gorokhov deleted the xc/avx_fp16 branch October 15, 2024 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Core OpenVINO Core (aka ngraph) category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants