Re-implementing the 1R experiments described in Holte, 1993.
The algorithm generates a rule for each attribute, and then picks the "one rule" that has the best accuracy.
For example, given a data set of drinking habits with attributes such as age, time of day, mood (attributes), 1R might produces a rule of the form:
if time="morning" then drink="coffee"
if time="afternoon" then drink="tea"
if time="evening" then drink="water"
The rule might only have, say, 60% accuracy. That's a baseline to compare to other algorithms.
New to Rust? 👋 Start by installing rustup
to get various tools, including the cargo
command. Then...
$ cargo build --quiet --release
❯ ./target/release/oner -d data/fake-house/house.csv -w
Config { data: "data/fake-house/house.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.70
IF size IS small THEN low
IF size IS big THEN high
IF size IS medium THEN medium
This application assumes attributes (features) are the columns and rows are the instances (examples).
I have taken data sets and converted to CSV where necessary, including adding header rows.
The data
folder contains the data from various sources. Unless otherwise specified, it'll be the UCI Machine Learning Repository.
A breast cancer dataset.
In the CSV version I have moved the class from the first column to the last column (that's what this code expects).
I did this with: awk -F, '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$1}' OFS=, < breast-cancer.data > bc.csv
Holt's experiments (section 2.2 of Holte 1993) used a random 2/3 of the data for training, and 1/3 for testing, repeated 25 times. The experiment resulted in a 0.687 classification accuracy on the test (Holte, table 3) against a baseline (0R) accuracy of 0.73 (table 2).
Model | Accuracy % |
---|---|
0R | 70.3 |
1R | 68.7 |
This code (mean of 10 seeds) | 68.4 |
This code (median of 10 seeds) | 68.3 |
This code (range of 10 seeds) | 67.6 - 69.6 |
The Chess (King-Rook vs. King-Pawn) dataset.
Model | Accuracy % |
---|---|
0R | 52.5 |
1R | 67.6 |
This code (mean of 10 seeds) | 67.6 |
This code (median of 10 seeds) | 67.6 |
This code (range of 10 seeds) | 67.2 - 67.8 |
The Iris dataset. The CSV version was created with:
$ echo "SepalLengthInCm,SepalWidthInCm,PetalLengthInCm,PetalWidthInCm,Class" > iris.csv
$ cat iris.data >> iris.csv
Model | Accuracy % |
---|---|
0R | 33.3 |
1R | 93.5 |
This code (mean of 10 seeds) | 95.1 |
This code (median of 10 seeds) | 95.0 |
This code (range of 10 seeds) | 94.5 - 95.9 |
Using the whole data set:
❯ ./target/release/oner -d data/ir/iris.csv -w
Config { data: "data/ir/iris.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.960
IF PetalWidthInCm IS < 1 THEN Iris-setosa
IF PetalWidthInCm IS >= 1 and < 1.7 THEN Iris-versicolor
IF PetalWidthInCm IS >= 1.7 THEN Iris-virginica
The dataset used to introduce 1R in Interpretable Machine Learning (published under CC BY-NC-SA 4.0). To run the example use the -w
flag to use the whole dataset for rule discovery.
oner 0.2.0
USAGE:
oner [FLAGS] [OPTIONS] --data <filename>
FLAGS:
--help Prints help information
-h, --hide-rules Suppress printing of rules at end of run
-w, --use-whole-dataset Use all the data for training and testing (overrides -t)
-V, --version Prints version information
OPTIONS:
--distinct-above <distinct-above>
An attribute must have more than than this number of distinct values for a column to be detected as numeric
(and so quantized) [default: 6]
-d, --data <filename> Complete data set to learn from (in CSV format, with header row)
-m, --missing <missing>
When quantizing, a value to treat as a missing value (in addition to blank attribute values) [default: ?]
-r, --repeats <repeats>
Number of times to repeat an experiment to report average accuracy [default: 25]
-s, --seed <seed> Random seed [default: 1]
--small <small>
When quantizing, an interval must have a dominant class must occur more than this many times. [default: 6]
-t, --training-fraction <training-fraction>
Fraction of the data to use for training (vs. testing) [default: 0.6666666666666666]
Copyright 2020 Richard Dallaway
This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.