[GPU] LSTMSequence and LSTMCell optimization #26767

michal-miotk · 2024-09-24T14:09:12Z

Details:

creating simple primitive for lstm_sequence to be faster than previous approach using many primitives
using oneDNN
based on commit c99ddc0 from 25732

Tickets:

146601

commit 232d272f11fbe65e82fa9787260a8b9d34b57d20 Author: michal-miotk <[email protected]> Date: Mon Jul 29 11:17:47 2024 +0000 wip commit e642ca3 Author: michal-miotk <[email protected]> Date: Sun Jul 28 22:08:24 2024 +0000 wip commit c6b74d3 Author: michal-miotk <[email protected]> Date: Fri Jul 26 14:10:26 2024 +0000 wip commit 0451429 Author: michal-miotk <[email protected]> Date: Thu Jul 25 20:35:11 2024 +0000 wip3

commit 1164592 Author: michal-miotk <[email protected]> Date: Tue Aug 6 09:25:45 2024 +0000 wip commit 8b2c049 Author: michal-miotk <[email protected]> Date: Tue Aug 6 09:24:02 2024 +0000 wip commit 886b412 Author: michal-miotk <[email protected]> Date: Mon Aug 5 14:59:14 2024 +0000 wip commit 08fb207 Author: michal-miotk <[email protected]> Date: Sun Aug 4 20:21:38 2024 +0000 wip, errors on half commit 125884d Author: michal-miotk <[email protected]> Date: Sat Aug 3 23:59:58 2024 +0000 wip commit af4f209 Author: michal-miotk <[email protected]> Date: Fri Aug 2 17:58:38 2024 +0000 wip commit 12626fc Author: michal-miotk <[email protected]> Date: Fri Aug 2 10:52:15 2024 +0000 wip commit dfdd052 Author: michal-miotk <[email protected]> Date: Thu Aug 1 15:38:41 2024 +0000 wip commit 54ee912 Author: michal-miotk <[email protected]> Date: Thu Aug 1 11:01:55 2024 +0000 only bfyx layout commit 240fe4a Author: michal-miotk <[email protected]> Date: Thu Aug 1 10:34:45 2024 +0000 two outputs from prim commit bc775be Author: michal-miotk <[email protected]> Date: Wed Jul 31 22:13:14 2024 +0000 wip commit d1cfd60 Author: michal-miotk <[email protected]> Date: Wed Jul 31 22:07:06 2024 +0000 wip commit 7d18884 Author: michal-miotk <[email protected]> Date: Wed Jul 31 19:19:04 2024 +0000 begin of handling reverse commit 39f64af Author: michal-miotk <[email protected]> Date: Wed Jul 31 15:37:06 2024 +0000 betterbetter commit 67b3c9a Author: michal-miotk <[email protected]> Date: Wed Jul 31 13:12:39 2024 +0000 better commit 6ded5aa Author: michal-miotk <[email protected]> Date: Wed Jul 31 10:12:31 2024 +0000 wip commit 1ccdacc Author: michal-miotk <[email protected]> Date: Tue Jul 30 23:07:21 2024 +0000 wip commit ab1307c Author: michal-miotk <[email protected]> Date: Tue Jul 30 22:00:50 2024 +0000 test passed commit bc65969 Author: michal-miotk <[email protected]> Date: Tue Jul 30 15:37:20 2024 +0000 wip commit 03cbf57 Author: michal-miotk <[email protected]> Date: Tue Jul 30 15:15:06 2024 +0000 only 2 outputs commit fd5f3dc Author: michal-miotk <[email protected]> Date: Tue Jul 30 14:47:12 2024 +0000 wip commit 939d23c Author: michal-miotk <[email protected]> Date: Tue Jul 30 11:34:56 2024 +0000 wip commit 2bb561f Author: michal-miotk <[email protected]> Date: Tue Jul 30 09:28:03 2024 +0000 added to binary buffer commit 1ef83ff Author: michal-miotk <[email protected]> Date: Mon Jul 29 22:30:57 2024 +0000 not works

…tion only in gpu plugin

src/plugins/intel_gpu/src/graph/graph_optimizer/post_optimize_lstm_weights.cpp

src/plugins/intel_gpu/src/graph/layout_optimizer.cpp

src/plugins/intel_gpu/include/intel_gpu/graph/program.hpp

src/plugins/intel_gpu/include/intel_gpu/graph/topology.hpp

src/plugins/intel_gpu/src/graph/lstm_seq.cpp

src/plugins/intel_gpu/src/graph/program.cpp

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/lstm_cell_and_seq_bfyx.cl

…g onednn on template Signed-off-by: Michal Miotk <[email protected]>

Signed-off-by: Michal Miotk <[email protected]>

src/plugins/intel_gpu/src/graph/lstm_seq.cpp

vladimir-paramuzov · 2024-11-19T06:02:18Z

src/plugins/intel_gpu/include/intel_gpu/primitives/lstm.hpp


-struct lstm_elt : public primitive_base<lstm_elt> {
+struct lstm_elt : public RNNParams<lstm_elt> {


One of the initial goals of this patch was removing this lstm decomposition on the plugin side to bunch of custom primitives (and thus removing lstm_elt primitive). And that's still needed.

Also, as I can see, lstm_cell primitive is not used at all currently, which means there's no sense to add it. So my suggestion is to continue perf tuning then.

vladimir-paramuzov · 2024-11-19T06:49:00Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/lstm_cell_and_seq_bfyx.cl

+                const uint prev_idx = real_seq_length - i;
+            #else
+                const uint prev_idx = i-1;
+            #endif


[random spot] As I can see, bidirectional LSTM sequence works incorrectly.

[ RUN ] LSTMSequenceCommonZeroClip/LSTMSequenceTest.Inference/mode=PURE_SEQ_seq_lengths=2_batch=10_hidden_size=10_input_size=10_IS=(10.10)(10.10)(10.10)(40.10)(40.10)(40)_activations=(sigmoid.tanh.tanh)_direction=bidirectional_clip=0_WRBType=CONSTANT_modelType=f32_targetDevice=GPU_ Expected: 0.995055 Actual: 0 Coordinate: 20 Diff: 0.995055 calculated_abs_threshold: 9.96247e-05 abs_threshold: 1.19209e-07 rel_threshold: 0.0001 src/tests/functional/shared_test_classes/src/base/ov_subgraph.cpp:97: Failure [ COMPARATION ] COMPARATION IS FAILED! incorrect elem counter: 1 among 400 shapes. [ FAILED ] LSTMSequenceCommonZeroClip/LSTMSequenceTest.Inference/mode=PURE_SEQ_seq_lengths=2_batch=10_hidden_size=10_input_size=10_IS=(10.10)(10.10)(10.10)(40.10)(40.10)(40)_activations=(sigmoid.tanh.tanh)_direction=bidirectional_clip=0_WRBType=CONSTANT_modelType=f32_targetDevice=GPU_, where GetParam() = (PURE_SEQ, 2, 10, 10, 10, { "sigmoid", "tanh", "tanh" }, 0, bidirectional, CONSTANT, f32, "GPU") (52 ms)

(I've manually disabled bidirectional sequence decomposition and enforced usage of lstm_seq primitive to get that resul)t

src/plugins/intel_gpu/include/intel_gpu/primitives/lstm.hpp

src/plugins/intel_gpu/src/kernel_selector/kernels/lstm/lstm_seq_kernel_bfyx.cpp

vladimir-paramuzov · 2024-11-19T07:04:56Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

+        auto mutable_precision_firstsecond = op->get_output_element_type(1);
+        auto direction = op->get_direction();
+
+        if (p.use_new_shape_infer()) {


I suggest enforcing new shape infer for model with LSTM cell/seq here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/plugin/program_builder.cpp#L347 and just drop the code for legacy shape infer for this primitive

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov · 2024-11-21T12:10:31Z

src/plugins/intel_gpu/src/plugin/plugin.cpp

@@ -189,6 +189,9 @@ std::shared_ptr<ov::ICompiledModel> Plugin::compile_model(const std::shared_ptr<

    ExecutionConfig config = m_configs_map.at(device_id);
    config.set_user_property(orig_config);
+    if (context->get_engine().get_device_info().supports_immad) {


These change in plugin.cpp are not needed as you have same logic in apply_user_properties

vladimir-paramuzov · 2024-11-21T12:14:52Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-                        op_mode, 1, axis, num_splits));
-        p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
-                        false, outSzPt, op->get_output_partial_shape(1)));
+        p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \


I think you can enforce new shape infer for LSTMCell as well.

these out1_prim_id and out2_prim_id are not needed for new shape infer, so you can remove them from primitive api

Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

src/plugins/intel_gpu/src/graph/graph_optimizer/post_optimize_weights.cpp

src/plugins/intel_gpu/src/graph/layout_optimizer.cpp

src/plugins/intel_gpu/src/plugin/plugin.cpp

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov · 2024-11-22T06:25:37Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

    std::vector<cldnn::activation_func> activations;
    std::vector<cldnn::activation_additional_params> activation_params;
    GetLSTMActivationParams(op, activations, activation_params);
    float clip = op->get_clip();
-
+    assert(!inputs[5].pid.empty());
    if (p.use_new_shape_infer()) {


I suggest replacing it with OPENVINO_ASSERT to ensure that method is called correctly

vladimir-paramuzov · 2024-11-22T06:26:42Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-                        op_mode, 1, axis, num_splits));
-        p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
-                        false, outSzPt, op->get_output_partial_shape(1)));
+        p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \


Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

vladimir-paramuzov · 2024-11-22T06:27:10Z

src/plugins/intel_gpu/src/plugin/plugin.cpp

@@ -278,6 +278,9 @@ ov::SupportedOpsMap Plugin::query_model(const std::shared_ptr<const ov::Model>&

    ExecutionConfig config = m_configs_map.at(device_id);
    config.set_user_property(orig_config);
+    if (ctx->get_engine().get_device_info().supports_immad) {


These 2 changes are not needed too.

vladimir-paramuzov · 2024-11-22T06:33:18Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-            p.add_primitive(*op, cldnn::crop(cellStr, cldnn::input_info(lstm_elt_id), hiddenSz, cellCropSz));
-        }
+    const float clip = op->get_clip();
+    if (op->get_input_shape(2).size() != 3 || op->get_input_shape(3).size() != 1 \


nit: also redundant backslashes here and in other places. Please remove those

vladimir-paramuzov · 2024-11-22T06:33:38Z

src/plugins/intel_gpu/src/plugin/ops/rnn.cpp

-    p.add_primitive(*op, cldnn::reshape(layerName + ".out0", concatStr, tensor_from_dims(op->get_output_shape(0))), {layerName});
-    p.add_primitive(*op, cldnn::reshape(layerName + ".out1", hiddenStr, tensor_from_dims(op->get_output_shape(1))));
-    p.add_primitive(*op, cldnn::reshape(layerName + ".out2", cellStr, tensor_from_dims(op->get_output_shape(2))));
+    if (p.use_new_shape_infer()) {


OPENVINO_ASSERT here as well

vladimir-paramuzov · 2024-11-22T06:46:20Z

src/plugins/intel_gpu/src/graph/include/lstm_cell_inst.h

+public:
+    using parent::parent;
+
+    program_node& input() const { return get_dependency(0); }


Likely the same unused methods as for lstm_seq primitive

vladimir-paramuzov · 2024-11-22T06:49:25Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.hpp

+        std::vector<format::type> in_fmts(node.get_dependencies().size(), format::any);
+        std::vector<format::type> out_fmts(node.get_outputs_count(), format::any);
+
+        size_t out_rank = node.get_output_layout().get_rank();
+        for (size_t idx = 0 ; idx < node.get_dependencies().size() ; idx++) {
+            if (node.get_dependency(idx).is_constant())
+                continue;
+
+            auto target_format = format::get_default_format(out_rank);
+
+            in_fmts[idx] = target_format;
+        }
+        out_fmts[0] = format::ybfx;
+
+        return {in_fmts, out_fmts};


I think that code should actually query onednn for the required tensor formats (as it's done for convolutions). You can do it in the next PR

vladimir-paramuzov · 2024-11-22T06:51:47Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.hpp

+        return node.get_input_layout(0).format == cldnn::format::bfyx || node.get_input_layout(0).format == cldnn::format::fbyx \
+            || node.get_input_layout(0).format == cldnn::format::ybfx;


I think tensor format is not the only restriction. At least we need

Type checks

info.arch == gpu_arch::unknown (see other impls)

padding checks

1., 2. done, 3.not done

vladimir-paramuzov · 2024-11-22T06:58:12Z

src/plugins/intel_gpu/src/graph/impls/onednn/lstm_seq_onednn.cpp

+            int i = 0;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_LAYER, mem});
+        }
+
+        {
+            int i = 1;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_ITER, mem});
+        }
+
+        {
+            int i = 2;
+            auto& input = instance.input_memory(i);
+            auto offset = onednn::get_offset(instance.get_input_layout(i),
+                                             _pd.dnnl::primitive_desc_base::src_desc(i));
+            auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
+            args.insert({DNNL_ARG_SRC_ITER_C, mem});
+        }


I think this code can be done in a loop if you store these DNNL_ARG_SRC_LAYER, DNNL_ARG_SRC_ITER, etc in a vector. Same for weights and dst buffers

vladimir-paramuzov · 2024-11-22T07:04:04Z

src/plugins/intel_gpu/src/graph/graph_optimizer/post_optimize_weights.cpp

+    auto hiddenSize = reorder_params->get_output_layout().get_shape()[1] / 4;
+    auto cropSize = cldnn::tensor{dir_num, static_cast<int>(hiddenSize), 1, 1};
+    std::string crop_id_b = input_id + "_c";
+    auto get_crop_node = [&](int cropNum) -> cldnn::program_node& {
+        auto crop_id = primitive_id(crop_id_b + std::to_string(cropNum));
+        auto crop_prim = std::make_shared<cldnn::crop>(crop_id,  input_id, cropSize, cldnn::tensor{0, static_cast<int>(cropNum*hiddenSize), 0, 0});
+        return p.get_or_create(crop_prim);
+    };
+    auto& crop0_node = get_crop_node(0);
+    auto& crop1_node = get_crop_node(1);
+    auto& crop2_node = get_crop_node(2);
+    auto& crop3_node = get_crop_node(3);
+    std::vector<input_info> con_input{input_info(crop1_node.id()), input_info(crop0_node.id()), input_info(crop2_node.id()), input_info(crop3_node.id())};


Can it be done with some kind of Slice/StridedSlice primitive?

it can be, actually I've deleted one crop, but I don't think it will be easy to have less nodes using StridedSlice primitive

Signed-off-by: Michal Miotk <[email protected]>

…output of node Signed-off-by: Michal Miotk <[email protected]>

Signed-off-by: Michal Miotk <[email protected]>

michal-miotk added 30 commits July 18, 2024 11:36

compiles lstm_seq

9ce143a

more kernel args

027f991

bigger proper run chances

c191c58

19jul

d461e66

inference works

01fa2ac

in middle of implementation

1f017fd

problems with inputs get element in kernel

5787c7d

not compile

837db22

wipx

d4ce531

wip

19c268e

solved problem with too much inputs kernel

f5273bc

wip

d50b3be

more changes

63a8dfd

wip

f54ecc1

wip

3748a11

wip

fae772a

proper shape for 2 outputs

c00ff8a

cleaning

31fcb79

Merge branch 'master' into lstm2

4b16eef

updated to new primitive_base api, disabled lstm to tensor transforma…

dcad182

…tion only in gpu plugin

now it should compile on windows, changed kernel name

d6aeb54

deleted cell, deleted input_forget

9688f63

generic primitive

5003d47

fix compilation problem, smaller lws

5937b14

wip

8b31a91

wip, not resolved fail on dynamic

2ff5a7c

fixed failing dynamic test

2d9e5c6

change name cldnn::rnn -> cldnn::lstm_seq

702e941

fix PrimitiveTypeTest

3482f33

vladimir-paramuzov reviewed Nov 13, 2024

View reviewed changes

michal-miotk added 8 commits November 14, 2024 14:19

move functions from reorder factory to post_optimize_weights, enablin…

e174f1a

…g onednn on template Signed-off-by: Michal Miotk <[email protected]>

updated indexing in kernel

06f394a

Signed-off-by: Michal Miotk <[email protected]>

deleted redundant check of immad

483e637

Signed-off-by: Michal Miotk <[email protected]>

better serialization, better shape calc

819e21f

Signed-off-by: Michal Miotk <[email protected]>

better indexing

72232ea

Merge branch 'master' into lstm_with_onednn

178f7fa

tiny cleaning

3307752

Merge branch 'master' into lstm_with_onednn

2d58b17

vladimir-paramuzov reviewed Nov 19, 2024

View reviewed changes

michal-miotk added 2 commits November 20, 2024 01:35

wip

e1fcc01

Signed-off-by: Michal Miotk <[email protected]>

wip

9fa6fff

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov reviewed Nov 21, 2024

View reviewed changes

sshlyapn reviewed Nov 21, 2024

View reviewed changes

michal-miotk added 4 commits November 21, 2024 17:26

disable check on concat

8bca96d

Signed-off-by: Michal Miotk <[email protected]>

some kernel tuning, deleted unused var

63ee3b4

Signed-off-by: Michal Miotk <[email protected]>

new shape infer for lsm_cell, changes from review

c99a348

Signed-off-by: Michal Miotk <[email protected]>

deleting lstm elt

b551bda

Signed-off-by: Michal Miotk <[email protected]>

vladimir-paramuzov reviewed Nov 22, 2024

View reviewed changes

michal-miotk added 5 commits November 24, 2024 17:58

get_arguments in loop

d1bec7b

wip

0a9756d

fix compilation on windows

2a068c2

one less primitive in post optimize weights, deleted legacy output

fcdaab0

deleted unused has_cell function

6685209

Signed-off-by: Michal Miotk <[email protected]>

github-actions bot added the category: NPU OpenVINO NPU plugin label Nov 25, 2024

michal-miotk added 2 commits November 25, 2024 10:32

check node output layout(when concat) only for case when it is first …

debeb39

…output of node Signed-off-by: Michal Miotk <[email protected]>

undo adding level_zero

7ce51bc

Signed-off-by: Michal Miotk <[email protected]>

github-actions bot removed the category: NPU OpenVINO NPU plugin label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] LSTMSequence and LSTMCell optimization #26767

[GPU] LSTMSequence and LSTMCell optimization #26767

michal-miotk commented Sep 24, 2024 •

edited

Loading

vladimir-paramuzov Nov 19, 2024

vladimir-paramuzov Nov 19, 2024

michal-miotk Nov 21, 2024

vladimir-paramuzov Nov 19, 2024

michal-miotk Nov 21, 2024

vladimir-paramuzov Nov 21, 2024

michal-miotk Nov 21, 2024

vladimir-paramuzov Nov 21, 2024

michal-miotk Nov 21, 2024

vladimir-paramuzov Nov 22, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 25, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024

vladimir-paramuzov Nov 22, 2024

michal-miotk Nov 24, 2024


		struct lstm_elt : public primitive_base<lstm_elt> {
		struct lstm_elt : public RNNParams<lstm_elt> {

		return node.get_input_layout(0).format == cldnn::format::bfyx \|\| node.get_input_layout(0).format == cldnn::format::fbyx \
		\|\| node.get_input_layout(0).format == cldnn::format::ybfx;

[GPU] LSTMSequence and LSTMCell optimization #26767

Are you sure you want to change the base?

[GPU] LSTMSequence and LSTMCell optimization #26767

Conversation

michal-miotk commented Sep 24, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-miotk commented Sep 24, 2024 •

edited

Loading