Authorship attribution of Frankenstein
Hypothesis: Frankenstein contains small chunks of text (< 100 consecutive words) written by Percy Bysshe Shelley (PBS).
Results: The percentage of samples classified as PBS at different sample sizes support the hypothesis if the SVM (Support Vector Machine) classifier is trained on Part-of-Speech bigram frequency. This is evidenced by the result that, as sample sizes get smaller, the percentage of PBS classifications go up before they go down (see tags in figure below). This is due to the way the sampling process interacts with the chunks of PBS-authored texts in Frankenstein, see figure above.
When the classifier is trained on function word frequency, the proportion of PBS classified samples at different sample sizes can be explained by the monotonically increasing relationship between accuracy (or F-score) and sample size (see figure below). Because much more samples were written by MWS compared to PBS, at lower accuracies more samples actually written by MWS are misclassified as being authored by PBS than the other way around. This results in the monotonically decreasing curve for words in the figure above.