-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Global Buffer manager and optimization #2816
Conversation
Implemented global Buffers Optimized pipeline due to reduced buffer creation steps Modifed command queue and Buffer wrappers Signed-off-by: Debadri Samaddar <[email protected]>
d9afede
to
3d44c5f
Compare
Initialize buffer objects after command queue creation Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updates to create clBuffers in advance, which is related to #2812. Thank you for the hard work. Please check some opinions on this PR from my side:
nntrainer/cl_buffer_manager.h
Outdated
/** | ||
* @brief Buffer size in bytes preset (256 mebibytes) | ||
*/ | ||
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a fixed constant, what about adding const
?
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float); | |
const size_t buffer_size_bytes = 8192 * 8192 * sizeof(float); |
nntrainer/cl_buffer_manager.h
Outdated
opencl::Buffer *readBufferA; | ||
opencl::Buffer *readBufferB; | ||
opencl::Buffer *readBufferC; | ||
opencl::Buffer *writeBufferA; | ||
opencl::Buffer *writeBufferB; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it reasonable to let them public? From the name of this class, cl_buffer_manager, it would be better not to expose the right managing the buffers outside. Isn't it better to make them private and implement some methods to access them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment.
It sounds confusing (read / write - only holds from GPU perspective). What about changing kernelInBuffer / OutBuffer? I'm not sure my suggestion is the best option... i find we can find better names making them more claer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added abstraction in cl_buffer_manager
in the latest commit.
nntrainer/cl_buffer_manager.cpp
Outdated
void ClBufferManager::initBuffers() { | ||
readBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, true); | ||
readBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, true); | ||
readBufferC = new opencl::Buffer(context_inst_, buffer_size_bytes, true); | ||
writeBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, false); | ||
writeBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, false); | ||
ml_logi("ClBufferManager: Buffers initialized"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple question! Based on this PR, do we have to create all buffers for every kernel in advance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer creation consumes around 70-75% of the whole latency. This PR will create buffers only once at the beginning. All kernels will be able to re-use same or different buffers multiple times. Which means, buffer data update can happen multiple times but creation will happen only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood the concept. The point I want to ask is how may buffers should be created in advance. Also, as you mentioned, the buffer can be reused. Then, the manager should schedule the proper buffer by hiding the internal buffer assets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the PR, 5 buffers are created in advance each of 256 MiB. Also, as you suggested before I have added proper abstraction for cl_buffer_manager
.
2031515
to
3be62f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall updates look good :)
I have one question about the performance.
I run the unittest_blas_kernels_cl
to check its speedup; the result was interesting.
This PR shows regularly slow (545 ms total in avg.).
The previous version showed slow for the first run (754 ms) and (243 ms in avg.) for remaining trials.
Is it because the buffer manager create all buffers for every process in advance? I don't know why the previous version showed better performance except for the first run.
This PR contains new buffer implementation with only For proper calculation of latency, you can see the difference if you run the same kernel (
|
@@ -272,6 +275,9 @@ class ClContext { | |||
|
|||
// getContext() called inside createCommandQueue which creates clContext | |||
bool result = command_queue_inst_.CreateCommandQueue(); | |||
// initialize device buffers | |||
clbuffInstance.initBuffers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you gurantee that initBuffers() is called before any possible call to ClBufferManager::some-access-funtion?
Is there a reason not to do it at the constructor of ClBufferManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes initBuffers()
will be called at the beginning when context is initialized before any ClBufferManager
member is accessed.
The main reason of not doing it inside the constructor is to enforce lazy loading of the buffers. Otherwise buffers might get initialized before OpenCL context and command queue is initialized which might result in undefined behaviour when trying to read/write into the buffers. To avoid such ambiguity, it was removed from the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee the calling sequence (other memer functions are called after initBuffers()
), it should not depend on human programmer's plans, comments, documents, or promises. It is still prone to human errors (especially with open source contributors). It should be guaranteed by the code so that another developer who doesn't know about this in the open source community won't be able to reverse the sequence by mistake, thinking that this object is initialized already with getInstance or constructor.
At least, set the buffer pointers NULL at the constructor so that the caller is guaranteed to know that something's wrong if init wasn't called. And state (doxygen for potentially harmful member functions) that it will return NULL if init is not called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your insights. I have added initializers on the constructor and added relevant doc for the member functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than ClBufferManager's constructor/initializer timing, LGTM.
Adding abstraction in cl_buffer_manager using const for data size Signed-off-by: Debadri Samaddar <[email protected]>
3be62f0
to
2633e0f
Compare
Optimized GPU pipeline by managing
Buffer
creations globally:ClBufferManager
for globalBuffer
objects.clEnqueueReadBufferRect
andclEnqueueWriteBufferRect
.WriteDataRegion
andReadDataRegion
members ofBuffer
class to read and write to a particular region of device buffer.cl_context
.These changes optimizes the current GPU pipeline by managing device buffers optimally.
Self evaluation:
Signed-off-by: Debadri Samaddar [email protected]