Project developed as part of the course Hardware Architectures for Embedded and Edge AI at Politecnico di Milano during the academic year 2022–2023. It involves the development of a Tiny Machine Learning (TinyML) application that aims to deploy a Convolutional Neural Network (CNN) on the Arduino Nano 33 BLE, a tiny resource-constrained device. The application focuses on creating a Hand Gesture Recognition (HGR) system, which is a Machine Learning task and a sub-field of Computer Vision. The CNN is developed using the Transfer Learning paradigm and MobileNetV1 as a base model. The system can classify gestures performed by an individual and trigger specific actions in real time using the device's built-in low-power camera.
The contents of the repository are organized as follows:
arduino
: contains the code and the library required to deploy the model on the devicemodels
: contains the trained models in a Keras-compatible formatnotebooks
: contains the Jupyter Notebooks used for the Exploratory Data Analysis, dataset pre-processing and quantizationreports
: contains the complete report of the projectweights
: contains the weights of MobileNetV1 used for Transfer Learning
The files dataset.py
, model.py
, and train.py
contain helper functions and classes to manage the dataset, the model, and the training process, respectively. The training of the models is performed in the notebooks all-gestures-model.ipynb
, 5-gestures-model.ipynb
, and 5-gestures-v2-model.ipynb
.
The target device is the Arduino Nano 33 BLE, part of the TinyML kit, which also includes a low-power OV7675 camera.
As for any other TinyML application, the main challenge of the project will be optimizing the model such that it fits into the main memory of the device, which is constrained to a mere 256 KB. This limitation concerns not only the model itself but also the input data, the storage required to store the intermediate feature maps of the network, and the libraries required for the device’s functionality.
The dataset used in this project is the ASL dataset, which contains images of the American Sign Language alphabet. The dataset is composed of 87000 images, each of which is 256x256 pixels and contains a single letter of the alphabet. The dataset is divided into 29 classes, corresponding to the 26 letters of the alphabet, and three additional classes for space, delete, and nothing. Although not originally intended for hand-gesture recognition, it is possible to consider the different letters of the sign language as if they were distinct gestures.
In this section, the characteristics of the information contained in the dataset are analyzed. The goals of the EDA are the following:
- Check for inconsistencies in the data
- Check whether the class distribution is balanced
- Visualize a reduced-dimensional representation of the data
- Identify a subset of classes for model training
The data was found to be free of inconsistencies, and the class distribution was balanced. To select a subset of the gestures, feature extraction was performed on the images using the pre-trained CNN EfficientNetV2B1. Then, t-distributed Stochastic Neighbor Embedding (t-SNE) was applied to the extracted features to reduce the dimensionality and visualize the data. The following scatterplot shows the first two dimensions after t-SNE has been applied.
Unfortunately, the t-SNE plot does not show a clear separation between the classes, which makes it difficult to select a subset of gestures based on the visual representation. Therefore, the subset of gestures was decided based on the per-class error rate of a model trained on the entire dataset, which is called the all-gestures model.
This model is trained on the subset of gestures 'C', 'G', 'H', 'Q', 'space', and 'nothing'. The feature extractor consists of MobileNetV1 with
After training and fine-tuning the model using the Adam optimizer and the Categorical Cross-Entropy loss, it was evaluated on the test set. The following confusion matrix showcases its performance:
Quantization is required to deploy the model on the target device. This process allows for minimizing the inference time, the dimension of the model, and the average RAM usage, while the performance of the model can remain approximately the same. The quantization step has been performed by applying Post Training Quantization (PTQ). There are multiple ways to perform this operation, as shown in the table below:
The second technique leads to the highest performance. However, given that the Microcontroller (MCU) of the target device supports only full integer models, the third technique is selected. As expected, the quantized model performs slightly worse than the unoptimized one, but its accuracy is still very high nonetheless.
After the deployment on the device, the model was tested in real-time, with the hands of different individuals and using different backgrounds and lights. Its performance was found to be much lower than expected, with the model struggling to correctly classify all gestures. This issue is attributed to the limited variability in the dataset, which led to a search for another, more diverse dataset.
The original dataset was enriched by using the ASL Alphabet Test, which presents images with a very large diversity of lighting conditions and backgrounds; and by employing the low-power camera interfaced with the target device to capture new samples. The result is a dataset called ASL Final, which consists of 90 samples per class for a total of 540 labeled images.
Using this data, the same architecture used for the previous model is trained. To counteract the overfitting due to the small size of the dataset, the dropout rate of the Dropout layers is largely increased and extensive data augmentation techniques are used. Training and quantization were performed directly using the Edge Impulse framework. The following confusion matrix shows the performance of the model after full integer quantization:
To perform the deployment, the model was converted to a C++ file using Edge Impulse. The platform provides a library that allows the model to be run on the target device. To run a demo on the device, it is sufficient to upload the code on the device, select the correct serial port and open the serial monitor. Then, it will be possible to see the output of the device, which includes the prediction of the model and the confidence bound for each gesture.
The project was developed using Python 3.10 and TensorFlow 2.13.0.
Clone the repository and then install all the required packages with:
pip install -r requirements.txt
Licensed under MIT License