-
Notifications
You must be signed in to change notification settings - Fork 52
[Example] Simple Audio Detection
In this example, we will show you how to use ESP to create a simple audio classifier that can recognize sounds with distinct frequency spectrum characteristics. If you haven't read the tutorial of performing gesture recognition using accelerometers, you should at least skim it to know the general workflow.
The source code of this example can be found in the repository [link].
Here is a short video of demonstrating piano key press detection.
A Macbook with ESP setup (see the installation guide from the README. We will use the Macbook built-in microphone for this example. Optionally, you can plug in an external microphone. For example, an electret microphone can be hooked up to an Adruino and use this firmware code to collect audio data. You will then update the input stream (line 4) with SerialStream.
Fast Fourier Transform (FFT) is a signal processing technique to convert a signal from its original domain (such as time) to a representation in the frequency domain.
Support Vector Machines (SVM) a discriminative classifier formally defined by a separating hyperplane. The discriminative nature makes it a nice fit for audio classification. For more about SVM, please see the OpenCV SVM tutorial.
void setup() {
stream.setLabelsForAllDimensions({"audio"});
calibrator.addCalibrateProcess("Bias", "Remain silent", backgroundCollected)
.addCalibrateProcess("Range", "Shout as much as possible", shoutCollected);
pipeline.addFeatureExtractionModule(
FFT(kFFT_WindowSize, kFFT_HopSize,
DIM, FFT::RECTANGULAR_WINDOW, true, false));
pipeline.setClassifier(
SVM(SVM::LINEAR_KERNEL, SVM::C_SVC, true, true));
pipeline.addPostProcessingModule(ClassLabelFilter(25, 40));
useInputStream(stream);
useCalibrator(calibrator);
usePipeline(pipeline);
}
The code above first names the input data as "audio".
Then a calibrator is specified with two calibration process:
- Keep silent to collect data that represents the background noise.
- Shout or make noises so that the range of this microphone can be measured.
This is reflected in the custom GUI generated:
The pipeline uses the technology we've mentioned in the background part: FFT and SVM. You can see an example of the result of FFT below:
A ClassLabelFilter
post-processing module smoothes the prediction results in
case of false detection. If 25 out of the past 40 classifications are the same,
then we consider it a positive detection. The number 40 comes from back of
envelope calculation: we are sampling the audio at roughly 5 kHz, and FFT is
computed with a hop size of
128;
this means in 1 second, we will have 5000 / 128 = 39 classifcation results. We
picked 40 as a nice number. 25 should be a
tuneable parameter so that
different false positive rate can be controlled.
With this custom user code, we launch the application and follow the procedure
outlined in another tutorial (todo): collecting calibration samples,
collecting training data, naming each classes. In the training tab, if you press
f
, the training data and their corresponding feature vector (FFT here) will be
shown. Below is a screenshot from the demo we prepared in the video above:
One possible idea is to use this pipeline to detect customized tangible input devices such as Lamello [1].
[1] Savage, Valkyrie, et al. "Lamello: Passive acoustic sensing for tangible input components." Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2015.