-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit pointwise Conv1D implementation for "Latency" strategy #811
Explicit pointwise Conv1D implementation for "Latency" strategy #811
Conversation
pre-commit.ci autofix |
@jmduarte I'm actually trying this out now, but I just realized it is in vivado, is it possible to update this to vitis?? I would be happy to contribute if you want!! |
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[0], res_tmp[0], weights, biases); | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[1], res_tmp[1], weights, biases); | ||
if (CONFIG_T::reuse_factor > 2) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[2], res_tmp[2], weights, biases); | ||
if (CONFIG_T::reuse_factor > 3) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[3], res_tmp[3], weights, biases); | ||
if (CONFIG_T::reuse_factor > 4) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[4], res_tmp[4], weights, biases); | ||
if (CONFIG_T::reuse_factor > 5) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[5], res_tmp[5], weights, biases); | ||
if (CONFIG_T::reuse_factor > 6) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[6], res_tmp[6], weights, biases); | ||
if (CONFIG_T::reuse_factor > 7) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[7], res_tmp[7], weights, biases); | ||
if (CONFIG_T::reuse_factor > 8) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[8], res_tmp[8], weights, biases); | ||
if (CONFIG_T::reuse_factor > 9) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[9], res_tmp[9], weights, biases); | ||
if (CONFIG_T::reuse_factor > 10) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[10], res_tmp[10], weights, biases); | ||
if (CONFIG_T::reuse_factor > 11) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[11], res_tmp[11], weights, biases); | ||
if (CONFIG_T::reuse_factor > 12) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[12], res_tmp[12], weights, biases); | ||
if (CONFIG_T::reuse_factor > 13) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[13], res_tmp[13], weights, biases); | ||
if (CONFIG_T::reuse_factor > 14) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[14], res_tmp[14], weights, biases); | ||
if (CONFIG_T::reuse_factor > 15) | ||
pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[15], res_tmp[15], weights, biases); | ||
if (CONFIG_T::reuse_factor > 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a better way to do this ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can use the code-generation machinery like this:
hls4ml/hls4ml/backends/fpga/passes/codegen.py
Lines 6 to 51 in abaea98
class GenerateConvIm2col(OptimizerPass): | |
'''Generates tcode for im2col step of 1D/2d convolution''' | |
def match(self, node): | |
return isinstance(node, (Conv1D, Conv2D)) and node.model.config.get_config_value('IOType') == 'io_parallel' | |
def transform(self, model, node): | |
node_class = node.__class__.__name__ | |
if '1D' in node_class: | |
self._generate_im2col_1d(node) | |
elif '2D' in node_class: | |
self._generate_im2col_2d(node) | |
else: | |
raise Exception(f'Cannot generate instructions for node {node.name} ({node_class})') | |
def _generate_im2col_1d(self, node): | |
code_str = node.model.config.backend.generate_conv1d_line_buffer_fn( | |
node.get_attr('index'), | |
node.get_attr('n_partitions'), | |
node.get_input_variable().shape[0], | |
node.get_input_variable().shape[1], | |
kernel=node.get_attr('filt_width'), | |
stride=node.get_attr('stride_width'), | |
pad=(node.get_attr('pad_left'), node.get_attr('pad_right')), | |
) | |
node.set_attr('line_buffer_codegen', Source(code_str)) | |
def _generate_im2col_2d(self, node): | |
code_str = node.model.config.backend.generate_conv2d_line_buffer_fn( | |
node.get_attr('index'), | |
node.get_attr('n_partitions'), | |
node.get_input_variable().shape[0], | |
node.get_input_variable().shape[1], | |
node.get_input_variable().shape[2], | |
kernel=(node.get_attr('filt_height'), node.get_attr('filt_width')), | |
stride=(node.get_attr('stride_height'), node.get_attr('stride_width')), | |
pad=( | |
node.get_attr('pad_top'), | |
node.get_attr('pad_bottom'), | |
node.get_attr('pad_left'), | |
node.get_attr('pad_right'), | |
), | |
) | |
node.set_attr('line_buffer_codegen', Source(code_str)) |
hls4ml/hls4ml/backends/fpga/fpga_backend.py
Lines 671 to 731 in abaea98
def generate_conv1d_line_buffer_fn(self, layer_idx, n_partitions, in_W, in_C, kernel=3, stride=1, pad=0, dilation=1): | |
"""Generate a C++ function that mimics the im2col algorithm. This function works for 1D convolution. | |
The HLS compiler produces suboptimal designs for a im2col algorithm implementation, so a trick we use is | |
to generate a resulting a result of im2col transformation explicitly, instead of relying on loops. Since | |
the result depends on the paraleters of the convolution layer (the input size, the kernel size, stride etc), | |
we need to do this for every convolution layer. | |
Args: | |
layer_idx (int): Index of layer ('index' attribute). | |
n_partitions (int): Number of partitions to divide the input into. | |
The pixels in each partition will be processed in parallel. | |
in_W (int): Width of input. | |
in_C (int): Number of channels. | |
kernel (int, optional): Size of the kernel. Defaults to 3. | |
stride (int, optional): Stride length. Defaults to 1. | |
pad (int or Iterable, optional): Padding to apply. Defaults to 0. | |
Specified as either a number or a list [left_pad, right_pad]. | |
dilation (int, optional): Dilation rate. Defaults to 1. | |
Returns: | |
str: Generated C++ function | |
""" | |
if isinstance(pad, Iterable): | |
pad_left = pad[0] | |
pad_right = pad[1] | |
else: | |
pad_left = pad | |
pad_right = pad | |
im2col_matrix = self._compute_conv1d_im2col((in_W, in_C), kernel, stride, (pad_left, pad_right), dilation) | |
generated_code = ( | |
"template<class data_T, typename CONFIG_T>\n" | |
"class fill_buffer_{index} : public FillConv1DBuffer<data_T, CONFIG_T> {{\n" | |
" public:\n" | |
" static void fill_buffer(\n" | |
" data_T data[CONFIG_T::in_width * CONFIG_T::n_chan],\n" | |
" data_T buffer[CONFIG_T::n_pixels][CONFIG_T::filt_width * CONFIG_T::n_chan],\n" | |
" const unsigned partition\n" | |
" ) {{\n" | |
).format(index=layer_idx) | |
indent = ' ' | |
for partition_idx, partition in enumerate(np.split(im2col_matrix, n_partitions)): | |
generated_code += indent * 2 + f'if (partition == {partition_idx:>3}) {{\n' | |
for pixel_idx, arr in enumerate(partition): | |
buffer_stmts = [] | |
for j, v in enumerate(arr): | |
if v == 0: | |
val = '0' | |
else: | |
val = f'data[{int(v - 1)}]' | |
buffer_stmts.append(f'buffer[{pixel_idx}][{j}] = {val:>10};') | |
generated_code += indent * 3 + ' '.join(buffer_stmts) + '\n' | |
generated_code += '\n' + indent * 2 + '}\n' | |
generated_code += indent + '}\n' | |
generated_code += '};\n' | |
return generated_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vloncar @Duchstf I started this branch to use the code generation machinery: jmduarte#20
direct diff w.r.t. main
: main...jmduarte:split_pointwise_conv_by_rf_codegen
#881
is this better than the current approach?
90f9e10
to
56797e7
Compare
Superseded by #881 |
Pointwise Conv1D with code generation for "Latency" strategy (update of #811)
Description
This is mostly for discussion and to let others test it out like @Duchstf. This PR adds an explicit pointwise Conv1D implementation, where the reuse factor (
RF
) is used to split the layer execution and reuse the existing moduleRF
timesOriginal pointwise Conv1D:
(in_width, n_chan) -> (in_width, n_filt)
This PR splits it into
RF
calls of(in_width/RF, n_chan) -> (in_width/RF, n_filt)
(in_width/RF, n_chan) -> (in_width/RF, n_filt)
(in_width/RF, n_chan) -> (in_width/RF, n_filt)
The II ~ RF. To turn it on you have to configure
ConvImplementation
of the layer named<layer>
Limitations:
in_width
is divisible byRF
RF = 120
. Could be automated with code generation.Type of change
Tests
See test/pytest/test_pointwiseconv.py
Checklist
pre-commit
on the files I edited or added.