Considerations

High Level Questions

Design / prototyping tool vs. learning tool. Are we helping you get something working or understanding how it works?

Getting Good Signals

If you're not getting good data out of your sensors in the first place, nothing else is going to do any good. This includes:

correct hookup / functioning sensors / etc.
lowering noise (and knowing what's "good enough")
testing response of sensor data to physical phenomenon (calibration)

For example, how do you calibrate a microphone? Can't necessarily ensure silence, nor know the exact volume reaching the microphone. Could try cross comparison with the microphone on the computer, but is the volume the same on both microphones? What about other characteristics (frequency response, etc)?

Might need to use both actual calibration data collected by the user and "magic numbers" from datasheets or elsewhere.

In general, each sample of data (whether live data or training data) needs to be associated with its corresponding sensor calibration metadata (e.g. range and sample rate of the sensor). Calibration functions likely need to use the calibration metadata associated with all relevant data (including live data and all training data). For instance, if some training data was recorded with a 2G accelerometer, some with a 4G, and some with an 8G, the calibration function may want to clip all samples (live and recorded) to +/- 2G. Similarly, if working with audio data collected at different sample rates, the calibration function may want to sub-sample all data to the lowest sample rate.

Giving person that creates the tool a markup language to provide instructions (script) to user for collecting calibration data. Ability to collect and save the resulting data into specific places.

Sharing physical test samples, not just test data. Physical calibration? What references can we use? Include instructions for use.

Example author can share calibration data as a baseline for others, even if it's not quite right (e.g. resting humidity, which might be somewhat different from place to place but at least someone's else values are a baseline).

Can we distinguish different data from bad data? (e.g. different values from just noise). Estimating signal to noise ratio based on some specific training examples. When is training even worth doing? Detecting clipping (e.g. maximum being held too long or happening too often)? Can example author write detectors for bad data as well as calibrators (e.g. saturating an FSR, noise on a microphone)?

Example: accelerometer calibration. Using data from the accelerometer lying still on a flat surface, we could determine the 0g and 1g values (i.e. the range of data) along with an estimate of the noise. What about different sample rates (e.g. based on different baud rates between Arduino and computer)? Need temporal normalization.

Example: accelerometer quality. Wave the accelerometer around and check for clipping. In my ad-hoc testing, it seems like you need at least +/- 8G range to avoid clipping when waving the accelerometer around in your hand.

Example: color sensor. Seems complicated to translate between different color sensors w/ different color profiles (or different light sources). So we might just need to make the user collect their own training samples.

Example: audio. To what extent will you be able to re-use someone else's training data? Might need to tune the pipeline to get classification to work with different setups. Can we normalize temporally (down-sampling or up-sampling) rather than modifying the pipeline (e.g. FFT parameters)?

See Jeffrey et al, "A Pipelined Framework for Online Cleaning of Sensor Data Streams" for a system (also called ESP!) that provides methods for online cleaning of data, e.g. based on individual data points, aggregation / smoothing over time windows or redundant sensors, etc.

Managing Training Data

This includes acquiring, labeling, and editing training samples. It could also include drawing on existing corpuses of data, or allowing users to share data with each other.

Functions to automatically split into testing and training.

How do we encourage people to collect training data that covers a wide range of use cases? Part of the answer may be making it easy to add new training data when you find that a class isn't being recognized.

Providing different takes on training data, e.g. this data covers this lighting condition, or this data seems to work best with your test data.

Should we also provide a way to view the features derived from each training sample? (And pre-processing, etc.)

Handling temporal data: need to edit individual samples, show multiple samples.

Instantaneous data: do we need a way to edit or remove individual data points within a sample?

Don't need to recollect training data when you modify the pipeline.

Matching new training data to existing training data? Can we auto-suggest a class that your new sample might belong to?

Configuring Pipeline

Includes pre-processing, feature selection, classifier selection & tuning, etc.

Can example author help you find the right parameters, not just the space to explore?

What, if anything, needs to be specified in code? How much can we include in the GUI? Can we make the GUI able to accept small bits of code?

Which knobs can the system tune automatically to improve accuracy? Which do the user need to decide? Maybe the example author provides a default (generally good value) but the user can tune it?

How do we help people know what classifiers they might use? Documentation that explains the trade-offs?

Dynamic explanation that explains the parameter, e.g. be explicit about the difference between the last value of the parameter and the new one. e.g. "with this value, we detected these new things but not this other thing that we detected before."

Related work / references: See Juxtapose paper (Auto generate sliders for every global variable). Also see Michael Terry dissertation "Set-based user interaction", appendix B (partial: delay assignment of the value of a variable until runtime).

speed vs. accuracy (post-processing modules)
false positive vs. false negative (null rejection)

Marking up example so that we know which parameters people should be able to tweak and what they do.

What about cases where there's no machine learning algorithm? Just signal processing. There might still be some interesting things to do.

Can we stick with using algorithms whose "knobs" are more easily described to an end-user?

Evaluation / Testing

How do I know if it's working? How do I know what's not working?

Expected accuracy shared with examples. So you know what's feasible.

Test data shared by example author.

How do I evaluate false positives vs. false negatives? The user probably wants to specify false positives / false negatives and have the thresholds (or whatever) determined automatically. Or maybe we just make it easy to see how the false positives and false negatives change as you tune the thresholds.

Do you want to throw away an outlier because it's wrong or keep it because it's an important different version of the thing you're trying to recognize? How do people know?

What is the analysis process when training is slow? You might want to save multiple trained versions of the pipeline so you can compare them without retraining.

Might need to allow the user to specify parts of the test data to ignore (e.g. transition between two states where I don't really care what the system predicts). More generally, how does the user specify expected classifications? This is likely to be different for static vs. temporal classifiers.

How do we visualize features with different scales / units? Do we need a way to specify which units go with which features?

Things the User Might Want to Know

Is any particular training sample any good?

Particularly high information gain may be a sign that the training sample is bad -- although it could also mean that it's just very useful new information.

Do I need to keep collecting data? How much impact has recently-collected data had?

We might look at the information gain of each recently-collected sample. In some cases, though, even if the new sample has a low information gain (i.e. it is definitely not being confused with other classes), it still may be significantly changing the classifier boundary w.r.t. the overall feature space, which is important for null rejection. In that case, we might want to calculate some difference in the shape of the classifier's distribution with and without the sample.

Are these two classes feasible to distinguish?

How can I improve the differentiation between these two classes?

Scores We Can Calculate

Impact of new data.

Would be good to have some understanding of the extent to which new training samples are doing anything. Samples that do little might mean we can stop collecting data, while samples that do a lot may be a sign that they're not very good.

For some classifiers (e.g. SVM), this is intimately connected with the extent to which new training samples change the boundaries between classes (although see outlier detection). For other classifiers (e.g. Naive Bayes or DTW), we may also be interested in the extent to which a new sample changes the behavior of each individual class, independent of its effect on the separability of different classes. For instance, with Naive Bayes, we might measure the change in probability distribution using something like the Bhattacharyya distance, Hellinger distance, or Kullback-Leibler divergence.

Classes or individual samples are hard to distinguish / miscategorized

Need to show which classes are easily confused, and which samples within those classes cause problems. One way to think about this is the extent to which each samples intrudes on / is similar to samples of another class. This can be because it is a useful distinguishing example, or because it's low quality or simply mislabeled. There's not necessarily any way for the system to tell the difference between useful distinguishing examples and garbage data, so we want the user to be aware of these high information gain samples. See Guyon, Matic, Vapnik, "Discovering Informative Patterns and Data Cleaning" for a more thorough discussion.

This scoring of information gain may be different for each classifier. For instance, in an SVM, it could be the weights assigned to each sample. (Again, see Guyon, Matic, Vapnik.) For DTW, it might a comparison of the average distance of each sample to other samples in its class vs. its distance to the samples in other classes.

One complication here is that the data may be a time-series, so we need to summarize predictions over time. Ding et al, Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures survey various techniques for distance measurements for time series data. Li et al, Linking Temporal Records discuss methods of comparing time-series data that takes into account the typical continuity of the data. Ross et all discuss an approach to summarizing confusion matrices in Performance measures for summarizing confusion matrices: the AFRL COMPASE approach but they aggregate over different conditions, not time.

False negative: an action was missed by the system

Potential responses:

increase null rejection thresholds (variability)
add missed sample to the training data
add other training data samples

False positive: an action was detected that shouldn't have been

Potential responses:

lowering null rejection thresholds
deleting or trimming training samples that generated / contributed to the false positive

Useful information:

distance from the live data to the predicted class
some measure of the extent to which / in what way each sample in that class contributed

Scoring test data

Evaluation method may change based on the kind of classifier. e.g. may need to use a time-based algorithm (that allows for small variations in start and end points of classification) along the lines of André Gensler, Bernhard Sick, "Novel Criteria to Measure Performance of Time Series Segmentation Techniques"

Other Approaches

Tür, Gökhan, Mazin G. Rahim, and Dilek Z. Hakkani-Tür. "Active labeling for spoken language understanding." INTERSPEECH. 2003 suggest two approaches:

scoring training data using a classifier trained on previously-labeled (and verified) data
scoring training data using a classifier trained on the training data itself Then, have the user re-label data which is given a different label by the classifier than was initially assigned by the user, or those which are given the same label but a low confidence by the classifier.

Our problem is different from traditional active labeling / active learning because our user is supplying additional / new training data, not simply re-labeling existing data. That makes the problem different (harder?) because it's not a matter of simply correcting labels but collecting data that covers the feature space in a good way.

Integration Into Project

How does this work? Do you integrate the machine learning pipeline into an Arduino program that's compiled and loaded onto your board? Is the logic specified on the Arduino, with prediction happening on the computer, with input streaming from and predictions streaming to the Arduino?

Or do you write all the logic on the computer, with just sensors input and actuator outputs going from/to an Arduino? Or just write everything on a Raspberry Pi? In either case, in what programming language / framework do you write your system code?

Injecting key presses: games may get too hard if you play with bodily motions (w/ recognizer that's not perfect). Also key repeat may matter in ways that are difficult to simulate. May want to look into Processing games that we can tweak to make appropriately challenging / playable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly