-
Notifications
You must be signed in to change notification settings - Fork 52
Considerations
Design / prototyping tool vs. learning tool. Are we helping you get something working or understanding how it works?
If you're not getting good data out of your sensors in the first place, nothing else is going to do any good. This includes:
- correct hookup / functioning sensors / etc.
- lowering noise (and knowing what's "good enough")
- testing response of sensor data to physical phenomenon (calibration)
For example, how do you calibrate a microphone? Can't necessarily ensure silence, nor know the exact volume reaching the microphone. Could try cross comparison with the microphone on the computer, but is the volume the same on both microphones? What about other characteristics (frequency response, etc)?
Might need to use both actual calibration data collected by the user and "magic numbers" from datasheets or elsewhere.
In general, each sample of data (whether live data or training data) needs to be associated with its corresponding sensor calibration metadata (e.g. range and sample rate of the sensor). Calibration functions likely need to use the calibration metadata associated with all relevant data (including live data and all training data). For instance, if some training data was recorded with a 2G accelerometer, some with a 4G, and some with an 8G, the calibration function may want to clip all samples (live and recorded) to +/- 2G. Similarly, if working with audio data collected at different sample rates, the calibration function may want to sub-sample all data to the lowest sample rate.
Giving person that creates the tool a markup language to provide instructions (script) to user for collecting calibration data. Ability to collect and save the resulting data into specific places.
Sharing physical test samples, not just test data. Physical calibration? What references can we use? Include instructions for use.
Example author can share calibration data as a baseline for others, even if it's not quite right (e.g. resting humidity, which might be somewhat different from place to place but at least someone's else values are a baseline).
Can we distinguish different data from bad data? (e.g. different values from just noise). Estimating signal to noise ratio based on some specific training examples. When is training even worth doing? Detecting clipping (e.g. maximum being held too long or happening too often)? Can example author write detectors for bad data as well as calibrators (e.g. saturating an FSR, noise on a microphone)?
Example: accelerometer calibration. Using data from the accelerometer lying still on a flat surface, we could determine the 0g and 1g values (i.e. the range of data) along with an estimate of the noise. What about different sample rates (e.g. based on different baud rates between Arduino and computer)? Need temporal normalization.
Example: accelerometer quality. Wave the accelerometer around and check for clipping. In my ad-hoc testing, it seems like you need at least +/- 8G range to avoid clipping when waving the accelerometer around in your hand.
Example: color sensor. Seems complicated to translate between different color sensors w/ different color profiles (or different light sources). So we might just need to make the user collect their own training samples.
Example: audio. To what extent will you be able to re-use someone else's training data? Might need to tune the pipeline to get classification to work with different setups. Can we normalize temporally (down-sampling or up-sampling) rather than modifying the pipeline (e.g. FFT parameters)?
See Jeffrey et al, "A Pipelined Framework for Online Cleaning of Sensor Data Streams" for a system (also called ESP!) that provides methods for online cleaning of data, e.g. based on individual data points, aggregation / smoothing over time windows or redundant sensors, etc.
This includes acquiring, labeling, and editing training samples. It could also include drawing on existing corpuses of data, or allowing users to share data with each other.
Functions to automatically split into testing and training.
How do we encourage people to collect training data that covers a wide range of use cases? Part of the answer may be making it easy to add new training data when you find that a class isn't being recognized.
Providing different takes on training data, e.g. this data covers this lighting condition, or this data seems to work best with your test data.
Should we also provide a way to view the features derived from each training sample? (And pre-processing, etc.)
Handling temporal data: need to edit individual samples, show multiple samples.
Instantaneous data: do we need a way to edit or remove individual data points within a sample?
Don't need to recollect training data when you modify the pipeline.
Matching new training data to existing training data? Can we auto-suggest a class that your new sample might belong to?
Includes pre-processing, feature selection, classifier selection & tuning, etc.
Can example author help you find the right parameters, not just the space to explore?
What, if anything, needs to be specified in code? How much can we include in the GUI? Can we make the GUI able to accept small bits of code?
Which knobs can the system tune automatically to improve accuracy? Which do the user need to decide? Maybe the example author provides a default (generally good value) but the user can tune it?
How do we help people know what classifiers they might use? Documentation that explains the trade-offs?
Dynamic explanation that explains the parameter, e.g. be explicit about the difference between the last value of the parameter and the new one. e.g. "with this value, we detected these new things but not this other thing that we detected before."
Related work / references: See Juxtapose paper (Auto generate sliders for every global variable). Also see Michael Terry dissertation "Set-based user interaction", appendix B (partial: delay assignment of the value of a variable until runtime).
- speed vs. accuracy (post-processing modules)
- false positive vs. false negative (null rejection)
Marking up example so that we know which parameters people should be able to tweak and what they do.
What about cases where there's no machine learning algorithm? Just signal processing. There might still be some interesting things to do.
Can we stick with using algorithms whose "knobs" are more easily described to an end-user?
How do I know if it's working? How do I know what's not working?
Expected accuracy shared with examples. So you know what's feasible.
Test data shared by example author.
How do I evaluate false positives vs. false negatives? The user probably wants to specify false positives / false negatives and have the thresholds (or whatever) determined automatically. Or maybe we just make it easy to see how the false positives and false negatives change as you tune the thresholds.
Do you want to throw away an outlier because it's wrong or keep it because it's an important different version of the thing you're trying to recognize? How do people know?
What is the analysis process when training is slow? You might want to save multiple trained versions of the pipeline so you can compare them without retraining.
Might need to allow the user to specify parts of the test data to ignore (e.g. transition between two states where I don't really care what the system predicts). More generally, how does the user specify expected classifications? This is likely to be different for static vs. temporal classifiers.
How do we visualize features with different scales / units? Do we need a way to specify which units go with which features?
Particularly high information gain may be a sign that the training sample is bad -- although it could also mean that it's just very useful new information.
We might look at the information gain of each recently-collected sample. In some cases, though, even if the new sample has a low information gain (i.e. it is definitely not being confused with other classes), it still may be significantly changing the classifier boundary w.r.t. the overall feature space, which is important for null rejection. In that case, we might want to calculate some difference in the shape of the classifier's distribution with and without the sample.
Potential responses:
- increase null rejection thresholds (variability)
- add missed sample to the training data
- add other training data samples
Potential responses:
- lowering null rejection thresholds
- deleting or trimming training samples that generated / contributed to the false positive
Useful information:
- distance from the live data to the predicted class
- some measure of the extent to which / in what way each sample in that class contributed
One way to think about this is the extent to which each samples intrudes on / is similar to samples of another class. A general way to calculate this is by doing leave-one-out training of the model and then calculating the class likelihoods for the left-out sample.
We may want to specialize this scoring for different classifiers. For instance, in an SVM, we may be able to calculate information gain by using the weights assigned to each sample. Guyon, Matic, Vapnik, "Discovering Informative Patterns and Data Cleaning" For DTW, it might a comparison of the average distance of each sample to other samples in its class vs. its distance to the samples in other classes.
High information gain / confusion with other classes can be because a sample is a useful distinguishing example, or because it's low quality or simply mislabeled. There's not necessarily any way for the system to tell the difference between useful distinguishing examples and garbage data, so we want the user to be aware of these high information gain samples. See Guyon, Matic, Vapnik, "Discovering Informative Patterns and Data Cleaning" for a more thorough discussion.
One complication here is that the data may be a time-series, so we need to summarize predictions over time. Ding et al, Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures survey various techniques for distance measurements for time series data. Li et al, Linking Temporal Records discuss methods of comparing time-series data that takes into account the typical continuity of the data. Ross et all discuss an approach to summarizing confusion matrices in Performance measures for summarizing confusion matrices: the AFRL COMPASE approach but they aggregate over different conditions, not time.
We may also want to aggregate this confusion measure over all samples in each class, to show overall confusion between classes.
For some classifiers (e.g. Naive Bayes or DTW), we may be able to calculate the extent to which a new sample changes the behavior of each individual class, independent of its effect on the separability of different classes. For instance, with Naive Bayes, we might measure the change in probability distribution using something like the Bhattacharyya distance, Hellinger distance, or Kullback-Leibler divergence. For DTW, this might be something like the average distance of a sample to every other sample relative to the average distance of all (other) samples to each other. Or, for the GRT approach more specifically, we might look at the change to the "exemplar" sample with and without the inclusion of a given sample in the training set.
A simple approach might be doing leave-one-out training, then taking the distance from the left-out sample to its own class. This won't work for classifiers that don't implement class distances (e.g. SVM). It also may not do a good job of reflecting the extent to which a sample changes the classifier as a sample with a big distance might not affect the classifier much. For instance, in ANBC or GMM, a single far-away data point won't do much to change the overall mean and standard deviation. In DTW, a very different sample still might not change the sample that's chosen as the exemplar for the class, or the new exemplar may still be close to the previous one.
For other classifiers (e.g. SVM), class distributions may be intimately connected with the boundaries between classes (although see outlier detection using a one-class SVM).
Evaluation method may change based on the kind of classifier. e.g. may need to use a time-based algorithm (that allows for small variations in start and end points of classification) along the lines of André Gensler, Bernhard Sick, "Novel Criteria to Measure Performance of Time Series Segmentation Techniques"
Tür, Gökhan, Mazin G. Rahim, and Dilek Z. Hakkani-Tür. "Active labeling for spoken language understanding." INTERSPEECH. 2003 suggest two approaches:
- scoring training data using a classifier trained on previously-labeled (and verified) data
- scoring training data using a classifier trained on the training data itself Then, have the user re-label data which is given a different label by the classifier than was initially assigned by the user, or those which are given the same label but a low confidence by the classifier.
Our problem is different from traditional active labeling / active learning because our user is supplying additional / new training data, not simply re-labeling existing data. That makes the problem different (harder?) because it's not a matter of simply correcting labels but collecting data that covers the feature space in a good way.
How does this work? Do you integrate the machine learning pipeline into an Arduino program that's compiled and loaded onto your board? Is the logic specified on the Arduino, with prediction happening on the computer, with input streaming from and predictions streaming to the Arduino?
Or do you write all the logic on the computer, with just sensors input and actuator outputs going from/to an Arduino? Or just write everything on a Raspberry Pi? In either case, in what programming language / framework do you write your system code?
Injecting key presses: games may get too hard if you play with bodily motions (w/ recognizer that's not perfect). Also key repeat may matter in ways that are difficult to simulate. May want to look into Processing games that we can tweak to make appropriately challenging / playable.