Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Global Buffer manager and optimization #2816

Merged
merged 3 commits into from
Dec 23, 2024

Conversation

s-debadri
Copy link
Contributor

@s-debadri s-debadri commented Dec 4, 2024

Optimized GPU pipeline by managing Buffer creations globally:

  • Added singleton class ClBufferManager for global Buffer objects.
  • Created 5 buffers of 256 mebibytes globally which can be used by all the kernels multiple times.
  • Added wrappers for OpenCL APIs clEnqueueReadBufferRect and clEnqueueWriteBufferRect.
  • Implemented WriteDataRegion and ReadDataRegion members of Buffer class to read and write to a particular region of device buffer.
  • Lazy initialization of buffers at cl_context.
  • Tested with BLAS kernels.

These changes optimizes the current GPU pipeline by managing device buffers optimally.

Note: Device buffer preset size can be modified later if required.

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar [email protected]

Implemented global Buffers
Optimized pipeline due to reduced buffer creation steps
Modifed command queue and Buffer wrappers

Signed-off-by: Debadri Samaddar <[email protected]>
@s-debadri s-debadri force-pushed the global_buffer_feature branch from d9afede to 3d44c5f Compare December 4, 2024 10:11
Initialize buffer objects after command queue creation

Signed-off-by: Debadri Samaddar <[email protected]>
Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR updates to create clBuffers in advance, which is related to #2812. Thank you for the hard work. Please check some opinions on this PR from my side:

/**
* @brief Buffer size in bytes preset (256 mebibytes)
*/
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a fixed constant, what about adding const ?

Suggested change
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);
const size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);

Comment on lines 58 to 62
opencl::Buffer *readBufferA;
opencl::Buffer *readBufferB;
opencl::Buffer *readBufferC;
opencl::Buffer *writeBufferA;
opencl::Buffer *writeBufferB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to let them public? From the name of this class, cl_buffer_manager, it would be better not to expose the right managing the buffers outside. Isn't it better to make them private and implement some methods to access them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment.
It sounds confusing (read / write - only holds from GPU perspective). What about changing kernelInBuffer / OutBuffer? I'm not sure my suggestion is the best option... i find we can find better names making them more claer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added abstraction in cl_buffer_manager in the latest commit.

Comment on lines 24 to 31
void ClBufferManager::initBuffers() {
readBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
readBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
readBufferC = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
writeBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
writeBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
ml_logi("ClBufferManager: Buffers initialized");
}
Copy link
Contributor

@EunjuYang EunjuYang Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple question! Based on this PR, do we have to create all buffers for every kernel in advance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer creation consumes around 70-75% of the whole latency. This PR will create buffers only once at the beginning. All kernels will be able to re-use same or different buffers multiple times. Which means, buffer data update can happen multiple times but creation will happen only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood the concept. The point I want to ask is how may buffers should be created in advance. Also, as you mentioned, the buffer can be reused. Then, the manager should schedule the proper buffer by hiding the internal buffer assets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the PR, 5 buffers are created in advance each of 256 MiB. Also, as you suggested before I have added proper abstraction for cl_buffer_manager.

@s-debadri s-debadri force-pushed the global_buffer_feature branch 2 times, most recently from 2031515 to 3be62f0 Compare December 10, 2024 05:25
Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall updates look good :)

I have one question about the performance.
I run the unittest_blas_kernels_cl to check its speedup; the result was interesting.

This PR shows regularly slow (545 ms total in avg.).
The previous version showed slow for the first run (754 ms) and (243 ms in avg.) for remaining trials.
Is it because the buffer manager create all buffers for every process in advance? I don't know why the previous version showed better performance except for the first run.

@s-debadri
Copy link
Contributor Author

Overall updates look good :)

I have one question about the performance. I run the unittest_blas_kernels_cl to check its speedup; the result was interesting.

This PR shows regularly slow (545 ms total in avg.). The previous version showed slow for the first run (754 ms) and (243 ms in avg.) for remaining trials. Is it because the buffer manager create all buffers for every process in advance? I don't know why the previous version showed better performance except for the first run.

This PR contains new buffer implementation with only sgemv kernel as of now. So other unit tests were still creating additional buffers with this new PR. Which means extra buffers were being created with this new PR since all the BLAS kernels are not updated.

For proper calculation of latency, you can see the difference if you run the same kernel (sgemv for this PR) multiple times.

Note: This PR mainly contains the buffer manager and is tested with sgemv. New changes for other blas kernels will be done once this PR gets merged otherwise one PR will have too many commits.

@@ -272,6 +275,9 @@ class ClContext {

// getContext() called inside createCommandQueue which creates clContext
bool result = command_queue_inst_.CreateCommandQueue();
// initialize device buffers
clbuffInstance.initBuffers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you gurantee that initBuffers() is called before any possible call to ClBufferManager::some-access-funtion?
Is there a reason not to do it at the constructor of ClBufferManager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes initBuffers() will be called at the beginning when context is initialized before any ClBufferManager member is accessed.
The main reason of not doing it inside the constructor is to enforce lazy loading of the buffers. Otherwise buffers might get initialized before OpenCL context and command queue is initialized which might result in undefined behaviour when trying to read/write into the buffers. To avoid such ambiguity, it was removed from the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee the calling sequence (other memer functions are called after initBuffers()), it should not depend on human programmer's plans, comments, documents, or promises. It is still prone to human errors (especially with open source contributors). It should be guaranteed by the code so that another developer who doesn't know about this in the open source community won't be able to reverse the sequence by mistake, thinking that this object is initialized already with getInstance or constructor.

At least, set the buffer pointers NULL at the constructor so that the caller is guaranteed to know that something's wrong if init wasn't called. And state (doxygen for potentially harmful member functions) that it will return NULL if init is not called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insights. I have added initializers on the constructor and added relevant doc for the member functions.

Copy link
Member

@myungjoo myungjoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than ClBufferManager's constructor/initializer timing, LGTM.

Adding abstraction in cl_buffer_manager
using const for data size

Signed-off-by: Debadri Samaddar <[email protected]>
@s-debadri s-debadri force-pushed the global_buffer_feature branch from 3be62f0 to 2633e0f Compare December 23, 2024 04:52
@myungjoo myungjoo merged commit 3ac9404 into nnstreamer:main Dec 23, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants