[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

v-Golubev · 2024-10-11T10:39:11Z

Details:

Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough.

Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution.

Tickets:

CVS-154383

v-Golubev · 2024-11-14T20:47:57Z

@a-sidorova @IvanNovoselov could you please review the PR? Thanks

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/move_brgemm_repacking_out.cpp

src/plugins/intel_cpu/src/emitters/snippets/external_repacking_adjuster.hpp

src/plugins/intel_cpu/src/emitters/snippets/external_repacking_adjuster.cpp

.../intel_cpu/src/transformations/snippets/x64/pass/lowered/adjust_brgemm_copy_b_loop_ports.cpp

src/plugins/intel_cpu/src/nodes/subgraph.cpp

src/common/snippets/include/snippets/lowered/pass/serialize_control_flow.hpp

src/common/snippets/include/snippets/pass/tokenization.hpp

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

src/common/snippets/include/snippets/runtime_optimizer.hpp

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/move_brgemm_repacking_out.hpp

src/common/snippets/include/snippets/mha_parallel_wa_optimizer.hpp

src/common/snippets/src/mha_parallel_wa_optimizer.cpp

IvanNovoselov · 2024-11-19T17:23:02Z

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l
    if (linear_ir->is_dynamic()) {
        update_loop_args(linear_ir);
    }
+    update_data_offsets();
+    m_final_runtime_optimizers.run(*linear_ir);
+    m_config->m_latest_shapes = std::move(m_config->shapes);


Do you think std::move will work here? What will be the m_config->shapes be after this command?
What if we need to access it from any other method?

I use an assumption that this always performs as last step of update. Probably, we can even move this logic outside the update method. BTW config initial update section can be extracted too:

m_config->master_shape = linear_ir->get_master_shape(); m_config->io_shapes = extract_shapes(); m_config->io_layouts = extract_layouts();

What do you think?

I think we can keep master_shape io_shapes io_layouts update inside update(), since it aligns with initialization() for example.
My concern here is that update returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config.

Good idea, I moved m_config->shapes invalidation to get_updated_config. Also, after runtime_optimizers pipeline was implemented, it became possible to reuse RuntimeConfigurator::update in cpu configurator, so I did that to avoid code duplication

src/common/snippets/include/snippets/runtime_configurator.hpp

IvanNovoselov · 2024-11-19T17:33:54Z

src/common/snippets/src/pass/split_dimension_m.cpp

+            if (m_kernel < min_kernel_m)
+                break;


Why do we iterate up to m_dim if we know that we'll break here for sufficiently big divisors?
Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

You are right, but please let me address this later: after perf validation (WIP), I will probably have to significantly change this code anyway

Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

BTW I think so 🤔 If m_dim % min_kernel != 0, just find the nearest integer quotient of division

SpltM heuristic has changed after perf validation, please take a look

v-Golubev · 2024-11-20T21:33:57Z

@a-sidorova @IvanNovoselov the PR is ready for the 2nd review

src/common/snippets/tests/include/utils/split_dim_m.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

src/plugins/intel_cpu/src/nodes/subgraph.cpp

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

a-sidorova · 2024-11-21T06:35:34Z

src/common/snippets/src/pass/split_dimension_m.cpp

-            splited.second = divisor_1;
-            break;
+    // TODO: should we limit minimal kernel_m?
+    const size_t min_kernel_m = 4;


I'm not sure, but I believe than min_kernel_m should be 32-64 at least. For me, 4 is too low value. However, let's wait for the perf validation 😃

Please take a look at the updated heuristic

a-sidorova · 2024-11-21T06:45:27Z

src/common/snippets/src/pass/split_dimension_m.cpp

+            if (m_kernel < min_kernel_m)
+                break;


Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

BTW I think so 🤔 If m_dim % min_kernel != 0, just find the nearest integer quotient of division

src/plugins/intel_cpu/src/nodes/subgraph.cpp

IvanNovoselov · 2024-11-21T11:29:40Z

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l
    if (linear_ir->is_dynamic()) {
        update_loop_args(linear_ir);
    }
+    update_data_offsets();
+    m_final_runtime_optimizers.run(*linear_ir);
+    m_config->m_latest_shapes = std::move(m_config->shapes);


I think we can keep master_shape io_shapes io_layouts update inside update(), since it aligns with initialization() for example.
My concern here is that update returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config.

src/common/snippets/include/snippets/runtime_configurator.hpp

This reverts commit 1de39e8.

This reverts commit fb62330.

This reverts commit e1c3ed7.

a-sidorova

Great job 👍🏼

a-sidorova · 2024-11-22T06:25:31Z

src/common/snippets/src/pass/split_dimension_m.cpp

+    // Ideal case #2: M is divisible by optimal parallel work amount, and the new_m_dim is big enough
+    // In this case, each thread will execute the Snippets kernel 'batch_dim' times
+    if (m_dim % optimal_parallelism_work_amount == 0) {
+        const auto new_m_dim = m_dim / optimal_parallelism_work_amount;
+        const size_t min_kernel_m = 64;
+        if (new_m_dim >= min_kernel_m) {
+            splited.first = optimal_parallelism_work_amount;
+            splited.second = new_m_dim;
+            OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!");
+            return splited;
+        }
+    }
+


I think we need to return to this algorithm soon.
Because I'm not sure still that the current implementation covers our needs.
Just imagine that there is shape [1,5,16384,64] (SD) and optimal_parallelism_work_amount = 18 (our workstation).
Ideal case #1 will be skipped because 18 / 5 = 3.6 - not integer value.
Ideal case #2 will be skipped too because 16384 / 18 = 910.(2) - not integer value.
We will go to the next step and will get not optimal scheduling - I'd expect that in this case new_m_dim will be 32 or 64 (small) and new_batch_dim = 512 or 256 - yes, not all threads will process the same count of kernels. But there will a lot of kernels that we shouldn't notice this non-equality. This is my opinion and I'm not sure in 100% too. But I just change the thread count for SD - and Ideal Case #2 is broken now.

Agree, I reflected this point in the 157339 ticket

…nvinotoolkit#27007) ### Details: Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough. Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution. ### Tickets: - *CVS-154383*

…nvinotoolkit#27007) Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough. Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution. - *CVS-154383*

v-Golubev added the do_not_review label Oct 11, 2024

github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Oct 11, 2024

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 3 times, most recently from 9d0332e to f45c4fa Compare October 18, 2024 16:16

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 2 times, most recently from 18376d1 to fbc7368 Compare November 8, 2024 13:25

github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Nov 10, 2024

v-Golubev marked this pull request as ready for review November 12, 2024 10:26

v-Golubev requested review from a team as code owners November 12, 2024 10:26

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch from 0eae725 to bcdb12e Compare November 12, 2024 20:42

v-Golubev removed the do_not_review label Nov 14, 2024

v-Golubev assigned a-sidorova and IvanNovoselov Nov 14, 2024

a-sidorova reviewed Nov 15, 2024

View reviewed changes

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 4 times, most recently from 1b22709 to fb62330 Compare November 19, 2024 08:48

IvanNovoselov reviewed Nov 19, 2024

View reviewed changes

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch from 64c9426 to 68b8b06 Compare November 20, 2024 21:33

v-Golubev requested review from IvanNovoselov and a-sidorova November 20, 2024 21:34

a-sidorova reviewed Nov 21, 2024

View reviewed changes

IvanNovoselov reviewed Nov 21, 2024

View reviewed changes

v-Golubev added 16 commits November 21, 2024 17:15

Serialization passes updated

32fded1

Docs and cleanup

aef985f

Further cleanup

66773fd

compute_offsets refactoring

64a9fb9

Correct MHA tokenization

8666703

Cover SplitDimensionM heuristic by unit tests

435cf45

[WIP] Change splitM heuristic

1c64d03

Correct Transpose tokenization in tests

63e9876

Enable u8i8 and bf16 MHA tokenization with transpose_b=true

ef2a1a6

Alexandra's comments applied

4bc76e2

Ivan's comments applied

6860c67

Rest review comments

2ad10d2

Further refactoring in accordance to review suggestions

11865aa

Revert "Enable u8i8 and bf16 MHA tokenization with transpose_b=true"

8a391f1

This reverts commit 1de39e8.

Revert "[WIP] Change splitM heuristic"

bbd607d

This reverts commit fb62330.

Revert "Correct MHA tokenization"

504dbb2

This reverts commit e1c3ed7.

v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 2 times, most recently from f29c19d to 6df0b31 Compare November 21, 2024 22:32

v-Golubev added 3 commits November 21, 2024 23:34

Conservatively extend SplitDimensionM::get_splited_dimensions

01115d3

Revert changes in BF16 tests

d21a099

Finilize snippets tests

6df0b31

a-sidorova approved these changes Nov 22, 2024

View reviewed changes

IvanNovoselov approved these changes Nov 22, 2024

View reviewed changes

IvanNovoselov added this pull request to the merge queue Nov 22, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 22, 2024

mg-intel added this pull request to the merge queue Nov 23, 2024

Merged via the queue into openvinotoolkit:master with commit 287ab98 Nov 23, 2024
170 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

v-Golubev commented Oct 11, 2024 •

edited

Loading

v-Golubev commented Nov 14, 2024

IvanNovoselov Nov 19, 2024

v-Golubev Nov 20, 2024

IvanNovoselov Nov 21, 2024

v-Golubev Nov 21, 2024

IvanNovoselov Nov 19, 2024

v-Golubev Nov 20, 2024

a-sidorova Nov 21, 2024

v-Golubev Nov 21, 2024

v-Golubev commented Nov 20, 2024

a-sidorova Nov 21, 2024

v-Golubev Nov 21, 2024

a-sidorova Nov 21, 2024

IvanNovoselov Nov 21, 2024

a-sidorova left a comment

a-sidorova Nov 22, 2024

v-Golubev Nov 22, 2024

[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

Conversation

v-Golubev commented Oct 11, 2024 • edited Loading

Details:

Tickets:

v-Golubev commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-Golubev commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-sidorova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-Golubev commented Oct 11, 2024 •

edited

Loading