Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Global Buffer manager and optimization #2816

Merged
merged 3 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions nntrainer/cl_buffer_manager.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// SPDX-License-Identifier: Apache-2.0
/**
* Copyright (C) 2024 Debadri Samaddar <[email protected]>
*
* @file cl_buffer_manager.cpp
* @date 01 Dec 2024
* @see https://github.com/nnstreamer/nntrainer
* @author Debadri Samaddar <[email protected]>
* @bug No known bugs except for NYI items
* @brief This file contains global Buffer objects and manages them
*/

#include <cl_buffer_manager.h>

namespace nntrainer {

ClBufferManager &ClBufferManager::getInstance() {
static ClBufferManager instance;
return instance;
}

// to-do: Implementation to be updated with array of Buffer objects if required
// fp16 Buffer objects to be added in future
void ClBufferManager::initBuffers() {
readBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
readBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
readBufferC = new opencl::Buffer(context_inst_, buffer_size_bytes, true);
writeBufferA = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
writeBufferB = new opencl::Buffer(context_inst_, buffer_size_bytes, false);
ml_logi("ClBufferManager: Buffers initialized");
}
Copy link
Contributor

@EunjuYang EunjuYang Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple question! Based on this PR, do we have to create all buffers for every kernel in advance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer creation consumes around 70-75% of the whole latency. This PR will create buffers only once at the beginning. All kernels will be able to re-use same or different buffers multiple times. Which means, buffer data update can happen multiple times but creation will happen only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood the concept. The point I want to ask is how may buffers should be created in advance. Also, as you mentioned, the buffer can be reused. Then, the manager should schedule the proper buffer by hiding the internal buffer assets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the PR, 5 buffers are created in advance each of 256 MiB. Also, as you suggested before I have added proper abstraction for cl_buffer_manager.


ClBufferManager::~ClBufferManager() {
delete readBufferA;
delete readBufferB;
delete readBufferC;
delete writeBufferA;
delete writeBufferB;
ml_logi("ClBufferManager: Buffers destroyed");
}

} // namespace nntrainer
77 changes: 77 additions & 0 deletions nntrainer/cl_buffer_manager.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
// SPDX-License-Identifier: Apache-2.0
/**
* Copyright (C) 2024 Debadri Samaddar <[email protected]>
*
* @file cl_buffer_manager.h
* @date 01 Dec 2024
* @see https://github.com/nnstreamer/nntrainer
* @author Debadri Samaddar <[email protected]>
* @bug No known bugs except for NYI items
* @brief This file contains global Buffer objects and manages them
*/

#ifndef __CL_BUFFER_MANAGER_H__
#define __CL_BUFFER_MANAGER_H__

#include <string>

#include <opencl_buffer.h>
#include <opencl_context_manager.h>

#include <nntrainer_log.h>

namespace nntrainer {

/**
* @class ClBufferManager contains Buffer object management
* @brief Support for Buffer management
*/

class ClBufferManager {

private:
/**
* @brief Private constructor to prevent object creation
*
*/
ClBufferManager(){};

/**
* @brief OpenCl context global instance
*
*/
opencl::ContextManager &context_inst_ = opencl::ContextManager::GetInstance();

/**
* @brief Buffer size in bytes preset (256 mebibytes)
*/
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a fixed constant, what about adding const ?

Suggested change
size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);
const size_t buffer_size_bytes = 8192 * 8192 * sizeof(float);


public:
/**
* @brief Get Global ClBufferManager.
*
* @return ClBufferManager&
*/
static ClBufferManager &getInstance();

opencl::Buffer *readBufferA;
opencl::Buffer *readBufferB;
opencl::Buffer *readBufferC;
opencl::Buffer *writeBufferA;
opencl::Buffer *writeBufferB;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to let them public? From the name of this class, cl_buffer_manager, it would be better not to expose the right managing the buffers outside. Isn't it better to make them private and implement some methods to access them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment.
It sounds confusing (read / write - only holds from GPU perspective). What about changing kernelInBuffer / OutBuffer? I'm not sure my suggestion is the best option... i find we can find better names making them more claer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added abstraction in cl_buffer_manager in the latest commit.


/**
* @brief Initialize Buffer objects.
*/
void initBuffers();

/**
* @brief Destroy Buffer pointers.
*
*/
~ClBufferManager();
};
} // namespace nntrainer

#endif /* __CL_BUFFER_MANAGER_H__ */
8 changes: 7 additions & 1 deletion nntrainer/cl_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
#include <layer.h>
#include <layer_devel.h>

#include <cl_buffer_manager.h>
#include <opencl_command_queue_manager.h>
#include <opencl_context_manager.h>
#include <opencl_kernel.h>
Expand Down Expand Up @@ -79,12 +80,14 @@ class ClContext {

template <typename... Ts> using FactoryMap = std::tuple<IndexType<Ts>...>;

// getting static instance of commandqueue and opencl context
// getting static instance of commandqueue, opencl context and buffermanager
opencl::CommandQueueManager &command_queue_inst_ =
opencl::CommandQueueManager::GetInstance();

opencl::ContextManager &context_inst_ = opencl::ContextManager::GetInstance();

ClBufferManager &clbuffInstance = ClBufferManager::getInstance();

/**
* @brief Default constructor
*/
Expand Down Expand Up @@ -272,6 +275,9 @@ class ClContext {

// getContext() called inside createCommandQueue which creates clContext
bool result = command_queue_inst_.CreateCommandQueue();
// initialize device buffers
clbuffInstance.initBuffers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you gurantee that initBuffers() is called before any possible call to ClBufferManager::some-access-funtion?
Is there a reason not to do it at the constructor of ClBufferManager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes initBuffers() will be called at the beginning when context is initialized before any ClBufferManager member is accessed.
The main reason of not doing it inside the constructor is to enforce lazy loading of the buffers. Otherwise buffers might get initialized before OpenCL context and command queue is initialized which might result in undefined behaviour when trying to read/write into the buffers. To avoid such ambiguity, it was removed from the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee the calling sequence (other memer functions are called after initBuffers()), it should not depend on human programmer's plans, comments, documents, or promises. It is still prone to human errors (especially with open source contributors). It should be guaranteed by the code so that another developer who doesn't know about this in the open source community won't be able to reverse the sequence by mistake, thinking that this object is initialized already with getInstance or constructor.

At least, set the buffer pointers NULL at the constructor so that the caller is guaranteed to know that something's wrong if init wasn't called. And state (doxygen for potentially harmful member functions) that it will return NULL if init is not called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insights. I have added initializers on the constructor and added relevant doc for the member functions.


cl_initialized = result;
return cl_initialized;
};
Expand Down
2 changes: 2 additions & 0 deletions nntrainer/meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,9 @@ nntrainer_common_sources = [

if get_option('enable-opencl')
nntrainer_headers += meson.current_source_dir() / 'cl_context.h'
nntrainer_headers += meson.current_source_dir() / 'cl_buffer_manager.h'
nntrainer_common_sources += 'cl_context.cpp'
nntrainer_common_sources += 'cl_buffer_manager.cpp'
endif

foreach s : nntrainer_common_sources
Expand Down
30 changes: 29 additions & 1 deletion nntrainer/opencl/opencl_buffer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace nntrainer::opencl {
* @param read_only flag
* @param data data for the buffer
*/
Buffer::Buffer(ContextManager &context_manager, int size_in_bytes,
Buffer::Buffer(ContextManager &context_manager, size_t size_in_bytes,
bool read_only, void *data) {
cl_context context = context_manager.GetContext();
cl_mem_flags flags = read_only ? CL_MEM_READ_ONLY : CL_MEM_READ_WRITE;
Expand Down Expand Up @@ -94,6 +94,20 @@ bool Buffer::WriteData(CommandQueueManager &command_queue_inst,
return command_queue_inst.EnqueueWriteBuffer(mem_buf_, size_, data);
}

bool Buffer::WriteDataRegion(CommandQueueManager &command_queue_inst,
size_t size_in_bytes, const void *data,
size_t host_origin_offset,
size_t buffer_origin_offset) {
if (size_in_bytes > size_) {
ml_loge("Failed to write buffer region. Region size(%lu bytes) greater "
"than buffer size(%lu bytes).",
size_in_bytes, size_);
return false;
}
return command_queue_inst.EnqueueWriteBufferRegion(
mem_buf_, size_in_bytes, data, host_origin_offset, buffer_origin_offset);
}

/**
* @brief reading data from the buffer
*
Expand All @@ -105,6 +119,20 @@ bool Buffer::ReadData(CommandQueueManager &command_queue_inst, void *data) {
return command_queue_inst.EnqueueReadBuffer(mem_buf_, size_, data);
}

bool Buffer::ReadDataRegion(CommandQueueManager &command_queue_inst,
size_t size_in_bytes, void *data,
size_t host_origin_offset,
size_t buffer_origin_offset) {
if (size_in_bytes > size_) {
ml_loge("Failed to read from buffer region. Region size(%lu bytes) greater "
"than buffer size(%lu bytes).",
size_in_bytes, size_);
return false;
}
return command_queue_inst.EnqueueReadBufferRegion(
mem_buf_, size_in_bytes, data, host_origin_offset, buffer_origin_offset);
}

void *Buffer::MapBuffer(CommandQueueManager &command_queue_inst,
size_t offset_in_bytes, size_t size_in_bytes,
bool read_only, bool async) {
Expand Down
34 changes: 32 additions & 2 deletions nntrainer/opencl/opencl_buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ class Buffer {
* @param read_only flag
* @param data data for the buffer
*/
Buffer(ContextManager &context_manager, int size_in_bytes, bool read_only,
void *data);
Buffer(ContextManager &context_manager, size_t size_in_bytes, bool read_only,
void *data = nullptr);

/**
* @brief Move constructor for buffer by deleting the previous buffer
Expand Down Expand Up @@ -106,6 +106,21 @@ class Buffer {
*/
bool WriteData(CommandQueueManager &command_queue_inst, const void *data);

/**
* @brief writing data to a buffer region
*
* @param command_queue_inst reference of command queue instance
* @param size_in_bytes size of region
* @param data pointer of region
* @param host_origin_offset offset in the host memory region
* @param buffer_origin_offset offset in the buffer memory region
* @return true if successful write or false otherwise
*/
bool WriteDataRegion(CommandQueueManager &command_queue_inst,
size_t size_in_bytes, const void *data,
size_t host_origin_offset = 0,
size_t buffer_origin_offset = 0);

/**
* @brief reading data from the buffer
*
Expand All @@ -115,6 +130,21 @@ class Buffer {
*/
bool ReadData(CommandQueueManager &command_queue_inst, void *data);

/**
* @brief Reading data from a buffer region
*
* @param command_queue_inst reference of command queue instance
* @param size_in_bytes size of region
* @param data pointer of region
* @param host_origin_offset offset in the host memory region
* @param buffer_origin_offset offset in the buffer memory region
* @return true if successful write or false otherwise
*/
bool ReadDataRegion(CommandQueueManager &command_queue_inst,
size_t size_in_bytes, void *data,
size_t host_origin_offset = 0,
size_t buffer_origin_offset = 0);

/**
* @brief Mapping buffer to host memory
*
Expand Down
71 changes: 71 additions & 0 deletions nntrainer/opencl/opencl_command_queue_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,41 @@ bool CommandQueueManager::EnqueueReadBuffer(cl_mem buffer, size_t size_in_bytes,
return true;
}

bool CommandQueueManager::EnqueueReadBufferRegion(
cl_mem buffer, size_t size_in_bytes, void *data, size_t host_origin_offset,
size_t buffer_origin_offset, bool async) {

// managing synchronization
const cl_bool blocking = async ? CL_FALSE : CL_TRUE;

// (x, y, z) offset in the memory region associated with buffer
const size_t buffer_origin[] = {buffer_origin_offset, 0, 0};
// (x, y, z) offset in the memory region associated with host
const size_t host_origin[] = {host_origin_offset, 0, 0};
// region defines the (width in bytes, height in rows, depth in slices)
const size_t region[] = {size_in_bytes, 1, 1};
// length of each row in bytes
size_t row_pitch = region[0];
// length of each 2D slice in bytes
size_t slice_pitch = region[0] * region[1];

// Buffer and host data are interpreted as 1D in this case
// hence row and slice pitch are same for both
cl_int error_code = clEnqueueReadBufferRect(
command_queue_, buffer, blocking, buffer_origin, host_origin, region,
row_pitch, slice_pitch, row_pitch, slice_pitch, data, 0, nullptr, nullptr);

if (error_code != CL_SUCCESS) {
ml_loge("Failed to write data region to GPU (clEnqueueWriteBufferRect). "
"OpenCL error "
"code: %d",
error_code);
return false;
}

return true;
}

/**
* @brief Writing buffer object. Used from Buffer class
*
Expand All @@ -150,6 +185,7 @@ bool CommandQueueManager::EnqueueWriteBuffer(cl_mem buffer,
auto error_code =
clEnqueueWriteBuffer(command_queue_, buffer, blocking, 0, size_in_bytes,
data, 0, nullptr, nullptr);

if (error_code != CL_SUCCESS) {
ml_loge("Failed to upload data to GPU (clEnqueueWriteBuffer). OpenCL error "
"code: %d",
Expand All @@ -160,6 +196,41 @@ bool CommandQueueManager::EnqueueWriteBuffer(cl_mem buffer,
return true;
}

bool CommandQueueManager::EnqueueWriteBufferRegion(
cl_mem buffer, size_t size_in_bytes, const void *data,
size_t host_origin_offset, size_t buffer_origin_offset, bool async) {

// managing synchronization
const cl_bool blocking = async ? CL_FALSE : CL_TRUE;

// (x, y, z) offset in the memory region associated with buffer
const size_t buffer_origin[] = {buffer_origin_offset, 0, 0};
// (x, y, z) offset in the memory region associated with host
const size_t host_origin[] = {host_origin_offset, 0, 0};
// region defines the (width in bytes, height in rows, depth in slices)
const size_t region[] = {size_in_bytes, 1, 1};
// length of each row in bytes
size_t row_pitch = region[0];
// length of each 2D slice in bytes
size_t slice_pitch = region[0] * region[1];

// Buffer and host data are interpreted as 1D in this case
// hence row and slice pitch are same for both
cl_int error_code = clEnqueueWriteBufferRect(
command_queue_, buffer, blocking, buffer_origin, host_origin, region,
row_pitch, slice_pitch, row_pitch, slice_pitch, data, 0, nullptr, nullptr);

if (error_code != CL_SUCCESS) {
ml_loge("Failed to write data region to GPU (clEnqueueWriteBufferRect). "
"OpenCL error "
"code: %d",
error_code);
return false;
}

return true;
}

/**
* @brief Mapping a region of a buffer object into the host address space
*
Expand Down
Loading
Loading