[GPU] Global Buffer manager and optimization #2816

s-debadri · 2024-12-04T10:09:35Z

Optimized GPU pipeline by managing Buffer creations globally:

Added singleton class ClBufferManager for global Buffer objects.
Created 5 buffers of 256 mebibytes globally which can be used by all the kernels multiple times.
Added wrappers for OpenCL APIs clEnqueueReadBufferRect and clEnqueueWriteBufferRect.
Implemented WriteDataRegion and ReadDataRegion members of Buffer class to read and write to a particular region of device buffer.
Lazy initialization of buffers at cl_context.
Tested with BLAS kernels.

These changes optimizes the current GPU pipeline by managing device buffers optimally.

Note: Device buffer preset size can be modified later if required.

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar [email protected]

Implemented global Buffers Optimized pipeline due to reduced buffer creation steps Modifed command queue and Buffer wrappers Signed-off-by: Debadri Samaddar <[email protected]>

Initialize buffer objects after command queue creation Signed-off-by: Debadri Samaddar <[email protected]>

EunjuYang

This PR updates to create clBuffers in advance, which is related to #2812. Thank you for the hard work. Please check some opinions on this PR from my side:

EunjuYang · 2024-12-09T03:52:01Z

nntrainer/cl_buffer_manager.h

+  /**
+   * @brief Buffer size in bytes preset (256 mebibytes)
+   */
+  size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);


If it's a fixed constant, what about adding const ?

Suggested change

size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);

const size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);

EunjuYang · 2024-12-09T03:59:51Z

nntrainer/cl_buffer_manager.h

+  opencl::Buffer *readBufferA;
+  opencl::Buffer *readBufferB;
+  opencl::Buffer *readBufferC;
+  opencl::Buffer *writeBufferA;
+  opencl::Buffer *writeBufferB;


Is it reasonable to let them public? From the name of this class, cl_buffer_manager, it would be better not to expose the right managing the buffers outside. Isn't it better to make them private and implement some methods to access them?

One more comment.
It sounds confusing (read / write - only holds from GPU perspective). What about changing kernelInBuffer / OutBuffer? I'm not sure my suggestion is the best option... i find we can find better names making them more claer :)

Added abstraction in cl_buffer_manager in the latest commit.

EunjuYang · 2024-12-09T04:04:31Z

nntrainer/cl_buffer_manager.cpp

+void ClBufferManager::initBuffers() {
+  readBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
+  readBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
+  readBufferC = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
+  writeBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
+  writeBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
+  ml_logi("ClBufferManager: Buffers initialized");
+}


Simple question! Based on this PR, do we have to create all buffers for every kernel in advance?

Buffer creation consumes around 70-75% of the whole latency. This PR will create buffers only once at the beginning. All kernels will be able to re-use same or different buffers multiple times. Which means, buffer data update can happen multiple times but creation will happen only once.

I understood the concept. The point I want to ask is how may buffers should be created in advance. Also, as you mentioned, the buffer can be reused. Then, the manager should schedule the proper buffer by hiding the internal buffer assets.

As per the PR, 5 buffers are created in advance each of 256 MiB. Also, as you suggested before I have added proper abstraction for cl_buffer_manager.

jijoongmoon

LGTM

EunjuYang

Overall updates look good :)

I have one question about the performance.
I run the unittest_blas_kernels_cl to check its speedup; the result was interesting.

This PR shows regularly slow (545 ms total in avg.).
The previous version showed slow for the first run (754 ms) and (243 ms in avg.) for remaining trials.
Is it because the buffer manager create all buffers for every process in advance? I don't know why the previous version showed better performance except for the first run.

s-debadri · 2024-12-18T05:07:22Z

Overall updates look good :)

I have one question about the performance. I run the unittest_blas_kernels_cl to check its speedup; the result was interesting.

This PR shows regularly slow (545 ms total in avg.). The previous version showed slow for the first run (754 ms) and (243 ms in avg.) for remaining trials. Is it because the buffer manager create all buffers for every process in advance? I don't know why the previous version showed better performance except for the first run.

This PR contains new buffer implementation with only sgemv kernel as of now. So other unit tests were still creating additional buffers with this new PR. Which means extra buffers were being created with this new PR since all the BLAS kernels are not updated.

For proper calculation of latency, you can see the difference if you run the same kernel (sgemv for this PR) multiple times.

Note: This PR mainly contains the buffer manager and is tested with sgemv. New changes for other blas kernels will be done once this PR gets merged otherwise one PR will have too many commits.

myungjoo · 2024-12-20T09:10:51Z

nntrainer/cl_context.h

@@ -272,6 +275,9 @@ class ClContext {

    // getContext() called inside createCommandQueue which creates clContext
    bool result = command_queue_inst_.CreateCommandQueue();
+    // initialize device buffers
+    clbuffInstance.initBuffers();


Can you gurantee that initBuffers() is called before any possible call to ClBufferManager::some-access-funtion?
Is there a reason not to do it at the constructor of ClBufferManager?

Yes initBuffers() will be called at the beginning when context is initialized before any ClBufferManager member is accessed.
The main reason of not doing it inside the constructor is to enforce lazy loading of the buffers. Otherwise buffers might get initialized before OpenCL context and command queue is initialized which might result in undefined behaviour when trying to read/write into the buffers. To avoid such ambiguity, it was removed from the constructor.

To guarantee the calling sequence (other memer functions are called after initBuffers()), it should not depend on human programmer's plans, comments, documents, or promises. It is still prone to human errors (especially with open source contributors). It should be guaranteed by the code so that another developer who doesn't know about this in the open source community won't be able to reverse the sequence by mistake, thinking that this object is initialized already with getInstance or constructor.

At least, set the buffer pointers NULL at the constructor so that the caller is guaranteed to know that something's wrong if init wasn't called. And state (doxygen for potentially harmful member functions) that it will return NULL if init is not called.

Thanks for your insights. I have added initializers on the constructor and added relevant doc for the member functions.

myungjoo

Other than ClBufferManager's constructor/initializer timing, LGTM.

Adding abstraction in cl_buffer_manager using const for data size Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, skykongkong8, djeong20 and EunjuYang as code owners December 4, 2024 10:09

github-actions bot added the Need Review label Dec 4, 2024

[GPU] Global Buffer manager and optimization

3d44c5f

Implemented global Buffers Optimized pipeline due to reduced buffer creation steps Modifed command queue and Buffer wrappers Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the global_buffer_feature branch from d9afede to 3d44c5f Compare December 4, 2024 10:11

[GPU] Lazy initialization of Buffers

dab2d91

Initialize buffer objects after command queue creation Signed-off-by: Debadri Samaddar <[email protected]>

EunjuYang reviewed Dec 9, 2024

View reviewed changes

s-debadri force-pushed the global_buffer_feature branch 2 times, most recently from 2031515 to 3be62f0 Compare December 10, 2024 05:25

jijoongmoon approved these changes Dec 11, 2024

View reviewed changes

EunjuYang approved these changes Dec 18, 2024

View reviewed changes

myungjoo reviewed Dec 20, 2024

View reviewed changes

[GPU] Abstraction of cl_buffer_manager

2633e0f

Adding abstraction in cl_buffer_manager using const for data size Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the global_buffer_feature branch from 3be62f0 to 2633e0f Compare December 23, 2024 04:52

myungjoo approved these changes Dec 23, 2024

View reviewed changes

github-actions bot added PR/READY2MERGE and removed Need Review labels Dec 23, 2024

myungjoo merged commit 3ac9404 into nnstreamer:main Dec 23, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Global Buffer manager and optimization #2816

[GPU] Global Buffer manager and optimization #2816

s-debadri commented Dec 4, 2024 •

edited

Loading

EunjuYang left a comment

EunjuYang Dec 9, 2024

EunjuYang Dec 9, 2024

EunjuYang Dec 9, 2024

s-debadri Dec 10, 2024

EunjuYang Dec 9, 2024 •

edited

Loading

s-debadri Dec 10, 2024

EunjuYang Dec 10, 2024

s-debadri Dec 10, 2024

jijoongmoon left a comment

EunjuYang left a comment

s-debadri commented Dec 18, 2024

myungjoo Dec 20, 2024

s-debadri Dec 20, 2024

myungjoo Dec 21, 2024

s-debadri Dec 23, 2024

myungjoo left a comment

	size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);
	const size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);

[GPU] Global Buffer manager and optimization #2816

[GPU] Global Buffer manager and optimization #2816

Conversation

s-debadri commented Dec 4, 2024 • edited Loading

EunjuYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EunjuYang Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

EunjuYang left a comment

Choose a reason for hiding this comment

s-debadri commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myungjoo left a comment

Choose a reason for hiding this comment

s-debadri commented Dec 4, 2024 •

edited

Loading

EunjuYang Dec 9, 2024 •

edited

Loading