Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets] Move BrgemmCopyB repacking logic outside the Subgraph #27007

Merged

Conversation

v-Golubev
Copy link
Contributor

@v-Golubev v-Golubev commented Oct 11, 2024

Details:

Currently, CopyB repacking is always performed inside Subgraph. In the case when batch on B Matmul input is significantly smaller than batch on A Matmul input, and parallel work amount is big enough, this may lead to ineffective execution, since repacking for B input is performed in each parallel task whereas only one repacking iteration for each B batch is enough.

Within this PR, CopyB repacking is moved outside the snippets kernel and performed via common reorder primitive just before the snippets kernel execution.

Tickets:

@github-actions github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: CPU OpenVINO CPU plugin labels Oct 11, 2024
@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 3 times, most recently from 9d0332e to f45c4fa Compare October 18, 2024 16:16
@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 2 times, most recently from 18376d1 to fbc7368 Compare November 8, 2024 13:25
@github-actions github-actions bot removed the category: IE Tests OpenVINO Test: plugins and common label Nov 10, 2024
@v-Golubev v-Golubev marked this pull request as ready for review November 12, 2024 10:26
@v-Golubev v-Golubev requested review from a team as code owners November 12, 2024 10:26
@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch from 0eae725 to bcdb12e Compare November 12, 2024 20:42
@v-Golubev
Copy link
Contributor Author

@a-sidorova @IvanNovoselov could you please review the PR? Thanks

@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 4 times, most recently from 1b22709 to fb62330 Compare November 19, 2024 08:48
src/common/snippets/include/snippets/pass/tokenization.hpp Outdated Show resolved Hide resolved
src/common/snippets/include/snippets/runtime_optimizer.hpp Outdated Show resolved Hide resolved
src/common/snippets/src/mha_parallel_wa_optimizer.cpp Outdated Show resolved Hide resolved
@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l
if (linear_ir->is_dynamic()) {
update_loop_args(linear_ir);
}
update_data_offsets();
m_final_runtime_optimizers.run(*linear_ir);
m_config->m_latest_shapes = std::move(m_config->shapes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think std::move will work here? What will be the m_config->shapes be after this command?
What if we need to access it from any other method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use an assumption that this always performs as last step of update. Probably, we can even move this logic outside the update method. BTW config initial update section can be extracted too:

    m_config->master_shape = linear_ir->get_master_shape();
    m_config->io_shapes = extract_shapes();
    m_config->io_layouts = extract_layouts();

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep master_shape io_shapes io_layouts update inside update(), since it aligns with initialization() for example.
My concern here is that update returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I moved m_config->shapes invalidation to get_updated_config. Also, after runtime_optimizers pipeline was implemented, it became possible to reuse RuntimeConfigurator::update in cpu configurator, so I did that to avoid code duplication

Comment on lines 44 to 45
if (m_kernel < min_kernel_m)
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we iterate up to m_dim if we know that we'll break here for sufficiently big divisors?
Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but please let me address this later: after perf validation (WIP), I will probably have to significantly change this code anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

BTW I think so 🤔 If m_dim % min_kernel != 0, just find the nearest integer quotient of division

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpltM heuristic has changed after perf validation, please take a look

@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch from 64c9426 to 68b8b06 Compare November 20, 2024 21:33
@v-Golubev
Copy link
Contributor Author

@a-sidorova @IvanNovoselov the PR is ready for the 2nd review

splited.second = divisor_1;
break;
// TODO: should we limit minimal kernel_m?
const size_t min_kernel_m = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I believe than min_kernel_m should be 32-64 at least. For me, 4 is too low value. However, let's wait for the perf validation 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the updated heuristic

Comment on lines 44 to 45
if (m_kernel < min_kernel_m)
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we iterate up to smth like m_dim/min_kernel and remove this condition?

BTW I think so 🤔 If m_dim % min_kernel != 0, just find the nearest integer quotient of division

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
@@ -69,6 +67,9 @@ void CPURuntimeConfigurator::update(const ov::snippets::lowered::LinearIRCPtr& l
if (linear_ir->is_dynamic()) {
update_loop_args(linear_ir);
}
update_data_offsets();
m_final_runtime_optimizers.run(*linear_ir);
m_config->m_latest_shapes = std::move(m_config->shapes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep master_shape io_shapes io_layouts update inside update(), since it aligns with initialization() for example.
My concern here is that update returns config with invalid state, i.e. stolen io_shapes. So I propose to move latest_shapes initialization (and io_shapes corruption) to the latest possible moment. In this case, it's just before the return from get_updated_config.

@v-Golubev v-Golubev force-pushed the vg/snippets/copy_b_repacking_outside branch 2 times, most recently from f29c19d to 6df0b31 Compare November 21, 2024 22:32
Copy link
Contributor

@a-sidorova a-sidorova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job 👍🏼

Comment on lines +47 to +59
// Ideal case #2: M is divisible by optimal parallel work amount, and the new_m_dim is big enough
// In this case, each thread will execute the Snippets kernel 'batch_dim' times
if (m_dim % optimal_parallelism_work_amount == 0) {
const auto new_m_dim = m_dim / optimal_parallelism_work_amount;
const size_t min_kernel_m = 64;
if (new_m_dim >= min_kernel_m) {
splited.first = optimal_parallelism_work_amount;
splited.second = new_m_dim;
OPENVINO_ASSERT(splited.first * splited.second == m_dim, "Incorrect dimension M splitting!");
return splited;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to return to this algorithm soon.
Because I'm not sure still that the current implementation covers our needs.
Just imagine that there is shape [1,5,16384,64] (SD) and optimal_parallelism_work_amount = 18 (our workstation).
Ideal case #1 will be skipped because 18 / 5 = 3.6 - not integer value.
Ideal case #2 will be skipped too because 16384 / 18 = 910.(2) - not integer value.
We will go to the next step and will get not optimal scheduling - I'd expect that in this case new_m_dim will be 32 or 64 (small) and new_batch_dim = 512 or 256 - yes, not all threads will process the same count of kernels. But there will a lot of kernels that we shouldn't notice this non-equality. This is my opinion and I'm not sure in 100% too. But I just change the thread count for SD - and Ideal Case #2 is broken now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I reflected this point in the 157339 ticket

@IvanNovoselov IvanNovoselov added this pull request to the merge queue Nov 22, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 22, 2024
@mg-intel mg-intel added this pull request to the merge queue Nov 23, 2024
Merged via the queue into openvinotoolkit:master with commit 287ab98 Nov 23, 2024
170 checks passed
NishantPrabhuFujitsu pushed a commit to NishantPrabhuFujitsu/openvino that referenced this pull request Nov 26, 2024
…nvinotoolkit#27007)

### Details:
Currently, CopyB repacking is always performed inside Subgraph. In the
case when batch on B Matmul input is significantly smaller than batch on
A Matmul input, and parallel work amount is big enough, this may lead to
ineffective execution, since repacking for B input is performed in each
parallel task whereas only one repacking iteration for each B batch is
enough.

Within this PR, CopyB repacking is moved outside the snippets kernel and
performed via common reorder primitive just before the snippets kernel
execution.

### Tickets:
 - *CVS-154383*
a-sidorova pushed a commit to a-sidorova/openvino that referenced this pull request Nov 27, 2024
…nvinotoolkit#27007)

Currently, CopyB repacking is always performed inside Subgraph. In the
case when batch on B Matmul input is significantly smaller than batch on
A Matmul input, and parallel work amount is big enough, this may lead to
ineffective execution, since repacking for B input is performed in each
parallel task whereas only one repacking iteration for each B batch is
enough.

Within this PR, CopyB repacking is moved outside the snippets kernel and
performed via common reorder primitive just before the snippets kernel
execution.

 - *CVS-154383*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants