Parallel sampling with ov::threading #1233

mzegla · 2024-11-19T15:38:44Z

This PR implements the same functionality as: #1252, but in a different manner. Only one of them should be merged.

Since pipeline logic is executed on a single thread, there are periods of low CPU usage while pipeline is not executing inference, but some other logic like sampling which can take quite large fraction of time. Currently after inference is done we sample from each sequence group in a loop on a single thread which becomes an issue with sampling parameters that significantly extend sampling time for a single sequence group.

This PR extracts sampling logic for single sequence group into a separate method that can be executed independently from any other sequence group. In includes generic thread pool implementation that spawns certain amount of threads that are used to run sampling logic for different sequence groups in parallel.

Performance measurements confirm improvement especially for non greedy sampling and with high concurrency (the more sequence groups are scheduled for inference the more benefit from parallel sampling).

ilya-lavrenov · 2024-11-19T17:38:54Z

src/cpp/src/threadpool.hpp

+    bool stop = false;
+
+public:
+    ThreadPool(size_t num_threads = std::thread::hardware_concurrency())


typically, in OV we optimize with parallel_for or similar functions https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/core/parallel.hpp

This thread pool is used in a loop where next iteration uses some values computed in the last one and task scheduled on another thread needs those values. I'm not familiar with TBB development aspects so correct me if I'm wrong, but isn't such scenario an issue for parallel_for use? Doesn't each iteration need to be completely independent from another?

post rebase adjustments fix finish iteration move currently_processed_tokens update switch to async experimental threadpool remove access to shared struct in parallelized code synchronize beam search part

github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms labels Nov 19, 2024

ilya-lavrenov reviewed Nov 19, 2024

View reviewed changes

mzegla added 2 commits November 20, 2024 13:40

extract sampling for single sequence group and call it asynchronously

7d5dfb3

post rebase adjustments fix finish iteration move currently_processed_tokens update switch to async experimental threadpool remove access to shared struct in parallelized code synchronize beam search part

refactor

52d391a

mzegla force-pushed the parallel_sampling_poc branch from 589365b to 52d391a Compare November 20, 2024 12:41

github-actions bot added no-match-files category: cmake / build Cmake scripts labels Nov 20, 2024

mzegla force-pushed the parallel_sampling_poc branch 4 times, most recently from 72d1af7 to 109c7e1 Compare November 20, 2024 15:58

use tbb instead of threadpool

59a4e6d

mzegla force-pushed the parallel_sampling_poc branch 4 times, most recently from 89b54bd to d542605 Compare November 21, 2024 13:15

use ov threading

eec70e5

mzegla force-pushed the parallel_sampling_poc branch from d542605 to eec70e5 Compare November 21, 2024 13:35

mzegla changed the title ~~Parallel sampling~~ Parallel sampling with ov::threading Nov 25, 2024

mzegla mentioned this pull request Nov 25, 2024

Parallel sampling with threadpool #1252

Open

ilya-lavrenov self-assigned this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling with ov::threading #1233

Parallel sampling with ov::threading #1233

mzegla commented Nov 19, 2024 •

edited

Loading

ilya-lavrenov Nov 19, 2024

mzegla Nov 20, 2024

Parallel sampling with ov::threading #1233

Are you sure you want to change the base?

Parallel sampling with ov::threading #1233

Conversation

mzegla commented Nov 19, 2024 • edited Loading

ilya-lavrenov Nov 19, 2024

Choose a reason for hiding this comment

mzegla Nov 20, 2024

Choose a reason for hiding this comment

mzegla commented Nov 19, 2024 •

edited

Loading