hls4ml Optimization API [Part 1] #768

bo3z · 2023-04-16T09:08:45Z

This pull request introduces the first part of the hls4ml Optimization API - an automated workflow for hardware-aware model compression. By formulating pruning and weight sharing as a linear optimisation problem, the workflow iteratively selects redundant weights, considering the overall impact on hardware. The tool currently supports Keras and QKeras models, as well as various hardware objectives on GPUs (FLOPs) and FPGAs, with a Vivado hls4ml backend (DSP, BRAM, FF). However, the tool is both hardware- and framework-agnostic - most of the concepts readily generalise to other frameworks (e.g. PyTorch) and other hardware (e.g. Quartus backend). This allows end users to write custom objectives (e.g. Quartus latency optimisation), following a similar template.

Furthermore, this tool aims to bridge the gap between other libraries for model compression, such as TensorFlow Model Optimization and QKeras. The tool is directly integrated with QKeras and an updated version of Keras Surgeon, to aid model compression. Finally, this tool provides out-of-the-box support for structured pruning (filters, neurons), as well as gradient based ranking methods.

The exact implementation and motivations are further explained in the attached presentation. Initial results are shown on both classification and regression with various objectives including, sparsity, GPU FLOP reduction, Vivado DSPs and FFs. Since this is a large PR, it is recommended to review the commits one by one, as each commit is self-contained and can be checked out by itself. They are briefly explained below.

Supporting document and presentation

Available at: https://indico.cern.ch/event/1278049/

Type of change

New feature (non-breaking change which adds functionality)
A new research paper code implementation

Description

Contributions:

A new pattern pruning approach, inspired by the parallelism of FPGAs. For more information on the inspiration behind this approach, see the attached presentation.
Easy, out-of-the-box support for structured pruning and weight sharing.
Formulation of pruning as a Knapsack optimisation problem, relative to a hardware objective - maximise network performance while minimising some hardware resource(s)
Integration with Keras Surgeon and extended support for QKeras, to further reduce model footprint.
End-to-end flow for model optimisation, including an extensible & open-source library for hardware-aware and agnostic pruning of Keras & TensorFlow models.

Tests

Eight new unit tests were written in the PyTest framework. These tests are stored under hls4ml/test/pytest/optimization. Each test follows a single addition to the framework and are better explained by the individual commits.
Results on unstructured sparsity and Vivado resource estimate are recorded below. GPU FLOP optimisation will form a basis for a future study.
A full working example of how to use the tool is provided in the documentation folder, under the advanced section. Benchmark models, data sets and automated scripts will shortly be available in an additional repository, to be made public soon. Raw results of the synthesis, will be available on CERNBox.

Implementation Details

8ba4206 - introduces configuration files for the tool, as well as model attribute builder. A model attribute builder extracts layer information, and stores them in a framework-independent class. Depending on the objective, the attribute builder can take additional arguments, such as hls4ml config dictionary. All of these are stored in a generic class and used for selecting algorithm parameters
12fba05 - introduces three schedulers for sparsity - constant increment, polynomially decaying and binary halving, where the search space is iteratively halved until the optimal sparsity is found.
d95c956 - introduces utils for training Keras models, such as model gradients, back-propagation with weight freezing, calculating per-layer sparsity etc. Additionally, two new regularisers are added - one for Dense-based layers (Dense, QDense, but additionally works for Recurrent layers) and one for Conv2D layers (Conv2D, QConv2D). The regularisers can both penalise weight magnitude (pruning) or variance (weight sharing) at an arbitrary level -filter / neuron level, block or pattern.
de51797 - introduces various solvers (exact, greedy, MIP etc.) for the Knapsack problem, which is used to formulate model compression. Reasoning behind the formulation of pruning as a linear program (LP) is given in the attached document. By considering hardware utilisation as problem constraint and network performance as the objective function, it is possible to remove weights in a more informed way that unstructured pruning. Furthermore, this commit introduces the concept of objectives - a metric(s), such as hardware utilisation, latency or parameter count, the optimization problem should minimise.
a49a113 - introduces the logic behind selecting redundant (groups of weights), in an operation called masking. When ranking weights, both magnitude-based and gradient-based methods are possible. While gradient-based methods might produce better results, they are computationally expensive, so the choice is left for end users.
e655ab6 - introduces integration with Keras Surgeon, a library for removing structures (filters, neurons) from a Keras model and rewiring the graph. Keras Surgeon is no longer under active development, so it does not work with TensorFlow 2.3+. An updated version is stored in a forked repo on my GitHub, and a new addition to it is support for QKeras models: https://github.com/bo3z/keras-surgeon
399a98d - introduces a necessary pre-requisite for model pruning, the model builder, which adds a regularisation loss to every optimizable layer; to capture the loss of removing some of the (groups of) weights during training. The hyperparameters are automatically set using Keras Tuner.
47392ba - introduces the top-level function for Keras model compression and an objective for minimising GPU FLOPs.
a778e39 - introduces the wrapper function for compressing a Keras model, given a hls4ml config dictionary, making use of the above function. Furthermore, it introduces objective for minimising DSP, FFs and BRAM [WIP]
7cd25a0 - introduces documentation for the new tool, with a full working example.
Minor bug fixes and pre-commit f792ea6 and 82779ff

Results

Comparison with TensorFlow Model Optimization

The proposed is evaluated on a range of tasks including: jet classification, SVHN classification from the Fast CNNs paper and a Lenet-like model on Fashion MNIST classification. First, the developed library is compared with TFMOT, in terms of unstructured sparsity, across five trials. As seen, the two perform similarly, with hls4ml being significantly better on LeNet.

DSP-level pruning

Secondly, the method is evaluated on a range of reuse factors with strategy set to resource. These results are after full Vivado synthesis. Latency is reported from CoSim, not HLS estimate and it is in terms of clock cycles, reported as min and max. Where the model has been pruned, it was accelerated using "Unrolled Dense" #806. The baseline models are accelerated using the current version of master, 0.7 - 0.7.1. The decrease in latency is likely because unrolled dense uses the pipeline pragma, while standard resource uses dataflow. However, this is fine as pruning also reduces the number of LUT & FF. BM stands for baseline model, quantised to 16 bits (either <16, 6> or <16, 8>, depending on the accuracy) and BP-DSP stands for a model optimised for DSP utilisation, again quantised to 16 bits. BP-MO stands for multi-objective optimisation, targeting both BRAM and DSP utilisation.

First, DSP-level pruning is tested - the idea is to verify the effects of "pattern pruning" - pruning all the weights processed by the same DSP as RF varies. This is shown for jet tagging and SVHN, in both cases achieving significant reduction in DSP utilisation. Furthermore, due to the way hls4ml transposed and stores weights in BRAM, BRAM is also likely to reduce (the same way if pruning unstructured, some structures might be removed)

Multi-objective pruning

Next, verify multi-objective pruning - by pruning all the weights stored in the same BRAM (precision was set to 18 bits, due to the 36-bit width of BRAM), one can remove one block of RAM and two DSP for every pruned structured. Results are shown on jet tagging, since streaming CNNs overuse BRAM - however, in the next table, it is shown how this method can apply to LeNet, significantly reducing DSP utilisation and slightly reducing BRAM.

Heterogeneous multi-objective pruning for fast inference of LeNet

Consider accelerating a LeNet - in its simple form, it is too large to be accelerated fully unrolled, as the dense layers have ~48k and ~10k weights. Therefore, the design is pruned and accelerated heterogeneously - the Conv2D layers have a latency strategy and RF set to 1. The Dense layers have a Resource strategy - the first Dense layer uses a RF of 25 and the second on of 12. The output layer uses Latency strategy and RF = 1. The design is accelerated with <18, 8> precision. The effects of multi-objective pruning are shown in the table below. The algorithm will choose to prune some individual weights (a single DSP in Conv2D layers) and some groups of weights (a single BRAM block and 2 DSPs in Dense layers, depending on the solution of Knapsack problem).

Finally, it is shown how multi-objective pruning can be used to accelerate a general-purpose CNN for fast image classification on a medium-range accelerator card, ZCU102. The latency is reported in clock cycles, and the increase is likely due to the write out of the accelerator card.

Known limitations

This is the first part of the optimization API, introducing the software and ML-side of things. The second part will focus on hardware-specific implementations and improvements, including:

Code generation for Dense layers hls4ml Optimization API [Part 2] #809 uses exactly as many as DSP as required. However, there seems to be a small over-utilisation in LUTs / FFs, probably due to some fanout.
Sometimes (streaming CNNs) will overuse BRAM - still unclear why this is the case, as HLS estimates for BRAM are the same in unrolled and standard dense multiplication.
Lack of support for recurrent and Conv1D layers - however, the extensions follow a similar implementation to existing code, but haven't had the time to test it.

Checklist

I have read the guidelines for contributing.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have installed and run pre-commit on the files I edited or added.
I have added tests that prove my fix is effective or that my feature works.

bo3z · 2023-06-16T13:44:27Z

Pre-commit for this requires bit more work, as some of the lines are too long and there are also complains on constructor initialiser etc. I will get to it in the next few days. Otherwise ready for review. However, it turns out this doesn't trigger the new PyTests for the Optimization API? Is there a script that needs to be modified to include the newly added files.

docs/advanced/model_optimization.rst

jmitrevs · 2023-07-13T23:46:32Z

This is largely standalone, so I don't have many issues to not commit this. I see the following things left to do:

fix github warnings
Fix pre-commit issues
Make sure all the tests are being run. (I don't understand why they wouldn't be.)

hls4ml/optimization/keras/reduction.py

…o hls4ml-optimization-api-part-1

bo3z · 2023-12-04T14:32:18Z

The failing tests have been resolved and this can now be merged. The issue was due to Keras Surgeon - it uses an ancient version of PyTest. As such I have ignored the Keras Sureon test and removed it as a hard dependency but left very clear instructions to anyone wanting to use it on how to install it from GitHub. Anyway, the current (patched) Keras Surgeon is part of the FastML organisation, and, if it turns out there is interest in using it, it can later be fixed to solve the dependency issues.

Alongside #809, both branches should be up-to-date with master. For compatibility and testing sake, there is a branch combining the two PRs into a single branch: https://github.com/fastmachinelearning/hls4ml/tree/hardware-aware-pruning

As a side note, the paper describing the pruning algorithm is on arxiv: https://arxiv.org/abs/2308.05170. However, this is the pre-print version. I will include a link to the IEEE proceedings from FPT (held next week), once available. We can then include it the citation to the README.

jmitrevs · 2023-12-04T15:30:09Z

I want to run the pytests after the latest force-pushing but am having trouble triggering it.

jmitrevs · 2024-01-08T17:20:26Z

I am back from vacation. Are there any reasons not to merge this PR?

Cleanup docstrings

Opt1

bo3z requested a review from vloncar April 16, 2023 09:08

bo3z self-assigned this Apr 16, 2023

bo3z added the enhancement label Apr 16, 2023

jmduarte added the please test Trigger testing by creating local PR branch label Apr 17, 2023

bo3z force-pushed the hls4ml-optimization-api-part-1 branch from aa3b051 to 523061d Compare April 18, 2023 09:03

bo3z mentioned this pull request May 22, 2023

Propagate zeros from Conv layers to multiplication config #797

Merged

7 tasks

bo3z added 10 commits June 13, 2023 11:53

Optimization API config files & model attribute builder

8ba4206

Optimization sparsity schedulers

12fba05

Utils & regularizers for Keras optimization

d95c956

Knapsack solver & unstructured pruning objective

de51797

Keras optimization masking and weight removal logic

a49a113

Remove unused channels w/ Keras Surgeon

e655ab6

Hyperparameter tuning for pruning & weight sharing regularization

399a98d

Top-level Keras optimization function & GPU FLOPs optimization

47392ba

hls4ml objectives & top-level optimization function

a778e39

Add docs for hls4ml Optimization API

7cd25a0

bo3z mentioned this pull request Jun 13, 2023

hls4ml Optimization API [Part 2] #809

Merged

9 tasks

jmduarte added this to the v0.8.0 milestone Jun 15, 2023

bo3z added 2 commits June 16, 2023 11:51

Full support for multi-objective Vivado optimisation

f792ea6

part of pre-commit of Optimization API pt.1

82779ff

bo3z force-pushed the hls4ml-optimization-api-part-1 branch from 523061d to 82779ff Compare June 16, 2023 13:42

jmitrevs reviewed Jul 13, 2023

View reviewed changes

docs/advanced/model_optimization.rst Show resolved Hide resolved

jmduarte reviewed Jul 14, 2023

View reviewed changes

hls4ml/optimization/keras/reduction.py Outdated Show resolved Hide resolved

Fix missing packages; rename PyTests & pre-commit pt 2.

aac3f1f

bo3z force-pushed the hls4ml-optimization-api-part-1 branch from 3bcde15 to aac3f1f Compare September 15, 2023 17:13

Merge branch 'main' into hls4ml-optimization-api-part-1

5b4488e

bo3z force-pushed the hls4ml-optimization-api-part-1 branch 2 times, most recently from 2cf919b to a029d2e Compare September 18, 2023 11:56

jmitrevs added this to the v1.0.0 milestone Oct 20, 2023

bo3z and others added 2 commits December 4, 2023 13:05

Merge branch 'master' into hls4ml-optimization-api-part-1

4aff443

Merge branch 'fastmachinelearning:hls4ml-optimization-api-part-1' int…

1156ff9

…o hls4ml-optimization-api-part-1

vloncar added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Dec 4, 2023

bo3z force-pushed the hls4ml-optimization-api-part-1 branch 2 times, most recently from 14e987f to 9932f36 Compare December 4, 2023 13:47

Fix optimization failing PyTests

c4a5a0f

bo3z force-pushed the hls4ml-optimization-api-part-1 branch from 9932f36 to c4a5a0f Compare December 4, 2023 14:09

jmitrevs added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Dec 4, 2023

jmitrevs added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Jan 8, 2024

Cleanup docstrings

7a26a9a

vloncar mentioned this pull request Jan 25, 2024

Cleanup docstrings bo3z/hls4ml#3

Merged

Merge pull request #3 from vloncar/opt1

ad47f41

Cleanup docstrings

jmitrevs added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Jan 25, 2024

vloncar and others added 4 commits January 29, 2024 19:04

Fix docstring in ObjectiveEstimator

4675607

Add optimization API paper to reference.rst

8503c86

Rename optimize_keras_for_hls4ml

6eb391f

Merge pull request #4 from vloncar/opt1

e66d7e7

Opt1

vloncar approved these changes Jan 29, 2024

View reviewed changes

vloncar merged commit 33f8ffb into fastmachinelearning:main Jan 29, 2024
4 checks passed

calad0i mentioned this pull request May 14, 2024

Unrolled implementation for latency dense/conv layers #1014

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hls4ml Optimization API [Part 1] #768

hls4ml Optimization API [Part 1] #768

bo3z commented Apr 16, 2023 •

edited

Loading

bo3z commented Jun 16, 2023 •

edited

Loading

jmitrevs commented Jul 13, 2023

bo3z commented Dec 4, 2023

jmitrevs commented Dec 4, 2023

jmitrevs commented Jan 8, 2024

hls4ml Optimization API [Part 1] #768

hls4ml Optimization API [Part 1] #768

Conversation

bo3z commented Apr 16, 2023 • edited Loading

Supporting document and presentation

Type of change

Description

Tests

Implementation Details

Results

Comparison with TensorFlow Model Optimization

DSP-level pruning

Multi-objective pruning

Heterogeneous multi-objective pruning for fast inference of LeNet

Known limitations

Checklist

bo3z commented Jun 16, 2023 • edited Loading

jmitrevs commented Jul 13, 2023

bo3z commented Dec 4, 2023

jmitrevs commented Dec 4, 2023

jmitrevs commented Jan 8, 2024

bo3z commented Apr 16, 2023 •

edited

Loading

bo3z commented Jun 16, 2023 •

edited

Loading