Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] LSTMSequence and LSTMCell optimization #26767

Open
wants to merge 171 commits into
base: master
Choose a base branch
from

Conversation

michal-miotk
Copy link
Contributor

@michal-miotk michal-miotk commented Sep 24, 2024

Details:

  • creating simple primitive for lstm_sequence to be faster than previous approach using many primitives
  • using oneDNN
  • based on commit c99ddc0 from 25732

Tickets:

  • 146601

commit 232d272f11fbe65e82fa9787260a8b9d34b57d20
Author: michal-miotk <[email protected]>
Date:   Mon Jul 29 11:17:47 2024 +0000

    wip

commit e642ca3
Author: michal-miotk <[email protected]>
Date:   Sun Jul 28 22:08:24 2024 +0000

    wip

commit c6b74d3
Author: michal-miotk <[email protected]>
Date:   Fri Jul 26 14:10:26 2024 +0000

    wip

commit 0451429
Author: michal-miotk <[email protected]>
Date:   Thu Jul 25 20:35:11 2024 +0000

    wip3
commit 1164592
Author: michal-miotk <[email protected]>
Date:   Tue Aug 6 09:25:45 2024 +0000

    wip

commit 8b2c049
Author: michal-miotk <[email protected]>
Date:   Tue Aug 6 09:24:02 2024 +0000

    wip

commit 886b412
Author: michal-miotk <[email protected]>
Date:   Mon Aug 5 14:59:14 2024 +0000

    wip

commit 08fb207
Author: michal-miotk <[email protected]>
Date:   Sun Aug 4 20:21:38 2024 +0000

    wip, errors on half

commit 125884d
Author: michal-miotk <[email protected]>
Date:   Sat Aug 3 23:59:58 2024 +0000

    wip

commit af4f209
Author: michal-miotk <[email protected]>
Date:   Fri Aug 2 17:58:38 2024 +0000

    wip

commit 12626fc
Author: michal-miotk <[email protected]>
Date:   Fri Aug 2 10:52:15 2024 +0000

    wip

commit dfdd052
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 15:38:41 2024 +0000

    wip

commit 54ee912
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 11:01:55 2024 +0000

    only bfyx layout

commit 240fe4a
Author: michal-miotk <[email protected]>
Date:   Thu Aug 1 10:34:45 2024 +0000

    two outputs from prim

commit bc775be
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 22:13:14 2024 +0000

    wip

commit d1cfd60
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 22:07:06 2024 +0000

    wip

commit 7d18884
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 19:19:04 2024 +0000

    begin of handling reverse

commit 39f64af
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 15:37:06 2024 +0000

    betterbetter

commit 67b3c9a
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 13:12:39 2024 +0000

    better

commit 6ded5aa
Author: michal-miotk <[email protected]>
Date:   Wed Jul 31 10:12:31 2024 +0000

    wip

commit 1ccdacc
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 23:07:21 2024 +0000

    wip

commit ab1307c
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 22:00:50 2024 +0000

    test passed

commit bc65969
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 15:37:20 2024 +0000

    wip

commit 03cbf57
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 15:15:06 2024 +0000

    only 2 outputs

commit fd5f3dc
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 14:47:12 2024 +0000

    wip

commit 939d23c
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 11:34:56 2024 +0000

    wip

commit 2bb561f
Author: michal-miotk <[email protected]>
Date:   Tue Jul 30 09:28:03 2024 +0000

    added to binary buffer

commit 1ef83ff
Author: michal-miotk <[email protected]>
Date:   Mon Jul 29 22:30:57 2024 +0000

    not works
src/plugins/intel_gpu/src/graph/layout_optimizer.cpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/include/intel_gpu/graph/program.hpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/include/intel_gpu/graph/topology.hpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/src/graph/lstm_seq.cpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/src/graph/lstm_seq.cpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/src/graph/lstm_seq.cpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/src/graph/program.cpp Outdated Show resolved Hide resolved
src/plugins/intel_gpu/src/graph/lstm_seq.cpp Outdated Show resolved Hide resolved

struct lstm_elt : public primitive_base<lstm_elt> {
struct lstm_elt : public RNNParams<lstm_elt> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the initial goals of this patch was removing this lstm decomposition on the plugin side to bunch of custom primitives (and thus removing lstm_elt primitive). And that's still needed.

Also, as I can see, lstm_cell primitive is not used at all currently, which means there's no sense to add it. So my suggestion is to continue perf tuning then.

const uint prev_idx = real_seq_length - i;
#else
const uint prev_idx = i-1;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[random spot] As I can see, bidirectional LSTM sequence works incorrectly.

[ RUN      ] LSTMSequenceCommonZeroClip/LSTMSequenceTest.Inference/mode=PURE_SEQ_seq_lengths=2_batch=10_hidden_size=10_input_size=10_IS=(10.10)(10.10)(10.10)(40.10)(40.10)(40)_activations=(sigmoid.tanh.tanh)_direction=bidirectional_clip=0_WRBType=CONSTANT_modelType=f32_targetDevice=GPU_
Expected: 0.995055 Actual: 0 Coordinate: 20 Diff: 0.995055 calculated_abs_threshold: 9.96247e-05 abs_threshold: 1.19209e-07 rel_threshold: 0.0001
src/tests/functional/shared_test_classes/src/base/ov_subgraph.cpp:97: Failure
[ COMPARATION ] COMPARATION IS FAILED! incorrect elem counter: 1 among 400 shapes.
[  FAILED  ] LSTMSequenceCommonZeroClip/LSTMSequenceTest.Inference/mode=PURE_SEQ_seq_lengths=2_batch=10_hidden_size=10_input_size=10_IS=(10.10)(10.10)(10.10)(40.10)(40.10)(40)_activations=(sigmoid.tanh.tanh)_direction=bidirectional_clip=0_WRBType=CONSTANT_modelType=f32_targetDevice=GPU_, where GetParam() = (PURE_SEQ, 2, 10, 10, 10, { "sigmoid", "tanh", "tanh" }, 0, bidirectional, CONSTANT, f32, "GPU") (52 ms)

(I've manually disabled bidirectional sequence decomposition and enforced usage of lstm_seq primitive to get that resul)t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

auto mutable_precision_firstsecond = op->get_output_element_type(1);
auto direction = op->get_direction();

if (p.use_new_shape_infer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest enforcing new shape infer for model with LSTM cell/seq here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_gpu/src/plugin/program_builder.cpp#L347 and just drop the code for legacy shape infer for this primitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Michal Miotk <[email protected]>
Signed-off-by: Michal Miotk <[email protected]>
@@ -189,6 +189,9 @@ std::shared_ptr<ov::ICompiledModel> Plugin::compile_model(const std::shared_ptr<

ExecutionConfig config = m_configs_map.at(device_id);
config.set_user_property(orig_config);
if (context->get_engine().get_device_info().supports_immad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These change in plugin.cpp are not needed as you have same logic in apply_user_properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

op_mode, 1, axis, num_splits));
p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
false, outSzPt, op->get_output_partial_shape(1)));
p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you can enforce new shape infer for LSTMCell as well.
  2. these out1_prim_id and out2_prim_id are not needed for new shape infer, so you can remove them from primitive api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

std::vector<cldnn::activation_func> activations;
std::vector<cldnn::activation_additional_params> activation_params;
GetLSTMActivationParams(op, activations, activation_params);
float clip = op->get_clip();

assert(!inputs[5].pid.empty());
if (p.use_new_shape_infer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replacing it with OPENVINO_ASSERT to ensure that method is called correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

op_mode, 1, axis, num_splits));
p.add_primitive(*op, cldnn::reshape(outputCellID, cldnn::input_info(outputCellCropID),
false, outSzPt, op->get_output_partial_shape(1)));
p.add_primitive(*op, cldnn::lstm_cell(layerName+".out0", inputs[0], inputs[1], inputs[2], inputs[3], inputs[4], inputs[5], \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I think item 2 is still relevant. You pass this layerName + "_md_write.1" argument, and the corresponding parameters from primitive API are still there.

@@ -278,6 +278,9 @@ ov::SupportedOpsMap Plugin::query_model(const std::shared_ptr<const ov::Model>&

ExecutionConfig config = m_configs_map.at(device_id);
config.set_user_property(orig_config);
if (ctx->get_engine().get_device_info().supports_immad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 changes are not needed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

p.add_primitive(*op, cldnn::crop(cellStr, cldnn::input_info(lstm_elt_id), hiddenSz, cellCropSz));
}
const float clip = op->get_clip();
if (op->get_input_shape(2).size() != 3 || op->get_input_shape(3).size() != 1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also redundant backslashes here and in other places. Please remove those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

p.add_primitive(*op, cldnn::reshape(layerName + ".out0", concatStr, tensor_from_dims(op->get_output_shape(0))), {layerName});
p.add_primitive(*op, cldnn::reshape(layerName + ".out1", hiddenStr, tensor_from_dims(op->get_output_shape(1))));
p.add_primitive(*op, cldnn::reshape(layerName + ".out2", cellStr, tensor_from_dims(op->get_output_shape(2))));
if (p.use_new_shape_infer()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPENVINO_ASSERT here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public:
using parent::parent;

program_node& input() const { return get_dependency(0); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely the same unused methods as for lstm_seq primitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 30 to 44
std::vector<format::type> in_fmts(node.get_dependencies().size(), format::any);
std::vector<format::type> out_fmts(node.get_outputs_count(), format::any);

size_t out_rank = node.get_output_layout().get_rank();
for (size_t idx = 0 ; idx < node.get_dependencies().size() ; idx++) {
if (node.get_dependency(idx).is_constant())
continue;

auto target_format = format::get_default_format(out_rank);

in_fmts[idx] = target_format;
}
out_fmts[0] = format::ybfx;

return {in_fmts, out_fmts};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that code should actually query onednn for the required tensor formats (as it's done for convolutions). You can do it in the next PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 24 to 25
return node.get_input_layout(0).format == cldnn::format::bfyx || node.get_input_layout(0).format == cldnn::format::fbyx \
|| node.get_input_layout(0).format == cldnn::format::ybfx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tensor format is not the only restriction. At least we need

  1. Type checks
  2. info.arch == gpu_arch::unknown (see other impls)
  3. padding checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1., 2. done, 3.not done

Comment on lines 35 to 59
int i = 0;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_LAYER, mem});
}

{
int i = 1;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_ITER, mem});
}

{
int i = 2;
auto& input = instance.input_memory(i);
auto offset = onednn::get_offset(instance.get_input_layout(i),
_pd.dnnl::primitive_desc_base::src_desc(i));
auto mem = input.get_onednn_memory(_pd.dnnl::primitive_desc_base::src_desc(i), offset);
args.insert({DNNL_ARG_SRC_ITER_C, mem});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code can be done in a loop if you store these DNNL_ARG_SRC_LAYER, DNNL_ARG_SRC_ITER, etc in a vector. Same for weights and dst buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 220 to 232
auto hiddenSize = reorder_params->get_output_layout().get_shape()[1] / 4;
auto cropSize = cldnn::tensor{dir_num, static_cast<int>(hiddenSize), 1, 1};
std::string crop_id_b = input_id + "_c";
auto get_crop_node = [&](int cropNum) -> cldnn::program_node& {
auto crop_id = primitive_id(crop_id_b + std::to_string(cropNum));
auto crop_prim = std::make_shared<cldnn::crop>(crop_id, input_id, cropSize, cldnn::tensor{0, static_cast<int>(cropNum*hiddenSize), 0, 0});
return p.get_or_create(crop_prim);
};
auto& crop0_node = get_crop_node(0);
auto& crop1_node = get_crop_node(1);
auto& crop2_node = get_crop_node(2);
auto& crop3_node = get_crop_node(3);
std::vector<input_info> con_input{input_info(crop1_node.id()), input_info(crop0_node.id()), input_info(crop2_node.id()), input_info(crop3_node.id())};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be done with some kind of Slice/StridedSlice primitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be, actually I've deleted one crop, but I don't think it will be easy to have less nodes using StridedSlice primitive

@github-actions github-actions bot added the category: NPU OpenVINO NPU plugin label Nov 25, 2024
@github-actions github-actions bot removed the category: NPU OpenVINO NPU plugin label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: GPU OpenVINO GPU plugin category: IE Tests OpenVINO Test: plugins and common
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants