-
Notifications
You must be signed in to change notification settings - Fork 52
Considerations
Design / prototyping tool vs. learning tool. Are we helping you get something working or understanding how it works?
If you're not getting good data out of your sensors in the first place, nothing else is going to do any good. This includes:
- correct hookup / functioning sensors / etc.
- lowering noise (and knowing what's "good enough")
- testing response of sensor data to physical phenomenon (calibration)
For example, how do you calibrate a microphone? Can't necessarily ensure silence, nor know the exact volume reaching the microphone. Could try cross comparison with the microphone on the computer, but is the volume the same on both microphones? What about other characteristics (frequency response, etc)?
Might need to use both actual calibration data collected by the user and "magic numbers" from datasheets or elsewhere.
In general, each sample of data (whether live data or training data) needs to be associated with its corresponding sensor calibration metadata (e.g. range and sample rate of the sensor). Calibration functions likely need to use the calibration metadata associated with all relevant data (including live data and all training data). For instance, if some training data was recorded with a 2G accelerometer, some with a 4G, and some with an 8G, the calibration function may want to clip all samples (live and recorded) to +/- 2G. Similarly, if working with audio data collected at different sample rates, the calibration function may want to sub-sample all data to the lowest sample rate.
Giving person that creates the tool a markup language to provide instructions (script) to user for collecting calibration data. Ability to collect and save the resulting data into specific places.
Sharing physical test samples, not just test data. Physical calibration? What references can we use? Include instructions for use.
Example author can share calibration data as a baseline for others, even if it's not quite right (e.g. resting humidity, which might be somewhat different from place to place but at least someone's else values are a baseline).
Can we distinguish different data from bad data? (e.g. different values from just noise). Estimating signal to noise ratio based on some specific training examples. When is training even worth doing? Detecting clipping (e.g. maximum being held too long or happening too often)? Can example author write detectors for bad data as well as calibrators (e.g. saturating an FSR, noise on a microphone)?
Example: accelerometer calibration. Using data from the accelerometer lying still on a flat surface, we could determine the 0g and 1g values (i.e. the range of data) along with an estimate of the noise. What about different sample rates (e.g. based on different baud rates between Arduino and computer)? Need temporal normalization.
Example: accelerometer quality. Wave the accelerometer around and check for clipping. In my ad-hoc testing, it seems like you need at least +/- 8G range to avoid clipping when waving the accelerometer around in your hand.
Example: color sensor. Seems complicated to translate between different color sensors w/ different color profiles (or different light sources). So we might just need to make the user collect their own training samples.
Example: audio. To what extent will you be able to re-use someone else's training data? Might need to tune the pipeline to get classification to work with different setups. Can we normalize temporally (down-sampling or up-sampling) rather than modifying the pipeline (e.g. FFT parameters)?
This includes acquiring, labeling, and editing training samples. It could also include drawing on existing corpuses of data, or allowing users to share data with each other.
Functions to automatically split into testing and training.
How do we encourage people to collect training data that covers a wide range of use cases? Part of the answer may be making it easy to add new training data when you find that a class isn't being recognized.
Providing different takes on training data, e.g. this data covers this lighting condition, or this data seems to work best with your test data.
Should we also provide a way to view the features derived from each training sample? (And pre-processing, etc.)
Handling temporal data: need to edit individual samples, show multiple samples.
Instantaneous data: do we need a way to edit or remove individual data points within a sample?
Don't need to recollect training data when you modify the pipeline.
Matching new training data to existing training data? Can we auto-suggest a class that your new sample might belong to?
Includes pre-processing, feature selection, classifier selection & tuning, etc.
Can example author help you find the right parameters, not just the space to explore?
What, if anything, needs to be specified in code? How much can we include in the GUI? Can we make the GUI able to accept small bits of code?
Which knobs can the system tune automatically to improve accuracy? Which do the user need to decide? Maybe the example author provides a default (generally good value) but the user can tune it?
How do we help people know what classifiers they might use? Documentation that explains the trade-offs?
Dynamic explanation that explains the parameter, e.g. be explicit about the difference between the last value of the parameter and the new one. e.g. "with this value, we detected these new things but not this other thing that we detected before."
Related work / references: See Juxtapose paper (Auto generate sliders for every global variable). Also see Michael Terry dissertation "Set-based user interaction", appendix B (partial: delay assignment of the value of a variable until runtime).
- speed vs. accuracy (post-processing modules)
- false positive vs. false negative (null rejection)
Marking up example so that we know which parameters people should be able to tweak and what they do.
What about cases where there's no machine learning algorithm? Just signal processing. There might still be some interesting things to do.
Can we stick with using algorithms whose "knobs" are more easily described to an end-user?
How do I know if it's working? How do I know what's not working?
Expected accuracy shared with examples. So you know what's feasible.
Test data shared by example author.
How do I evaluate false positives vs. false negatives? The user probably wants to specify false positives / false negatives and have the thresholds (or whatever) determined automatically. Or maybe we just make it easy to see how the false positives and false negatives change as you tune the thresholds.
Do you want to throw away an outlier because it's wrong or keep it because it's an important different version of the thing you're trying to recognize? How do people know?
What is the analysis process when training is slow? You might want to save multiple trained versions of the pipeline so you can compare them without retraining.
Might need to allow the user to specify parts of the test data to ignore (e.g. transition between two states where I don't really care what the system predicts). More generally, how does the user specify expected classifications? This is likely to be different for static vs. temporal classifiers.
How do we visualize features with different scales / units? Do we need a way to specify which units go with which features?
Need to show which classes are easily confused, and which samples within those classes cause problems. One complication here is that the data may be a time-series, so we need to summarize predictions over time. Ross et all discuss one approach to summarizing confusion matrices in Performance measures for summarizing confusion matrices: the AFRL COMPASE approach but they aggregate over different conditions, not time.
Potential responses:
- increase null rejection thresholds (variability)
- add missed sample to the training data
- add other training data samples
Potential responses:
- lowering null rejection thresholds
- deleting or trimming training samples that generated / contributed to the false positive
Useful information:
- distance from the live data to the predicted class
- some measure of the extent to which / in what way each sample in that class contributed
Would be good to have some understanding of the extent to which new training samples are doing anything.
Evaluation method may change based on the kind of classifier. e.g. may need to use a time-based algorithm (that allows for small variations in start and end points of classification) along the lines of André Gensler, Bernhard Sick, "Novel Criteria to Measure Performance of Time Series Segmentation Techniques"
Tür, Gökhan, Mazin G. Rahim, and Dilek Z. Hakkani-Tür. "Active labeling for spoken language understanding." INTERSPEECH. 2003 suggest two approaches:
- scoring training data using a classifier trained on previously-labeled (and verified) data
- scoring training data using a classifier trained on the training data itself Then, have the user re-label data which is given a different label by the classifier than was initially assigned by the user, or those which are given the same label but a low confidence by the classifier.
Our problem is different from traditional active labeling / active learning because our user is supplying additional / new training data, not simply re-labeling existing data. That makes the problem different (harder?) because it's not a matter of simply correcting labels but collecting data that covers the feature space in a good way.
How does this work? Do you integrate the machine learning pipeline into an Arduino program that's compiled and loaded onto your board? Is the logic specified on the Arduino, with prediction happening on the computer, with input streaming from and predictions streaming to the Arduino?
Or do you write all the logic on the computer, with just sensors input and actuator outputs going from/to an Arduino? Or just write everything on a Raspberry Pi? In either case, in what programming language / framework do you write your system code?
Injecting key presses: games may get too hard if you play with bodily motions (w/ recognizer that's not perfect). Also key repeat may matter in ways that are difficult to simulate. May want to look into Processing games that we can tweak to make appropriately challenging / playable.