SKLL 2.0
This is a major new release. It's probably the largest SKLL release we have ever done since SKLL 1.0 came out! It includes dozens of new features, bugfixes, and documentation updates!
⚡️ SKLL 2.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️
💥 Incompatible Changes 💥
-
Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue #497, PR #506).
-
Configuration field
objective
has been deprecated and replaced withobjectives
which allows specifying multiple tuning objectives for grid search (Issue #381, PR #458). -
Grid search is now enabled by default in both the API as well as while using a configuration file (Issue #463, PR #465).
-
The
Predictor
class previously provided by thegenerate_predictions
utility script is no longer available. If you were relying on this class, you should just load the model file and callLearner.predict()
instead (Issue #562, PR #566). -
There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue #381, PR #458).
-
mean_squared_error
is no longer supported as a metric. Useneg_mean_squared_error
instead (Issue #382, PR #470). -
The
cv_folds_file
configuration file field is now just calledfolds_file
(Issue #382, PR #470). -
Running an experiment with the
learning_curve
task now requires specifyingmetrics
in theOutput
section instead ofobjectives
in theTuning
section (Issue #382, PR #470). -
Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue #364, PRs #475 & #518).
-
pandas
andseaborn
are now direct dependencies of SKLL, and not optional (Issues #455 & #364, PRs #475 & #508).
💡 New features 💡
-
CSVReader
/CSVWriter
&TSVReader
/TSVWriter
now usepandas
as the backend rather than custom code that relied on thecsv
module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks here (Issue #364, PRs #475 & #518). -
SKLL models now have a new
pipeline
attribute which makes it easy to manipulate and use them inscikit-
learn, if needed (Issue #451, PR #474). -
The SKLL conda package is now a generic Python package which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public ETS anaconda channel.
-
SKLL learner hyperparameters have been updated to match the new
scikit-learn
defaults and those upcoming in 0.22.0 (Issue #438, PR #533). -
Intermediate results for the grid search process are now available in the
results.json
files (Issue #431, #471). -
The K models trained for each split of a K-fold cross-validation experiment can now be saved to disk (Issue #501, PR #505).
-
Missing values in CSV/TSV files can be dropped/replaced both via the command line and the API (Issue #540, PR #542).
-
Warnings from
scikit-learn
are now captured in SKLL log files (issue #441, PR #480). -
Learner.model_params()
and, consequently, theprint_model_weights
utility script now work with models trained on hashed features (issue #444, PR #466). -
The
print_model_weights
utility script can now output feature weights sorted by class labels to improve readability (Issue #442, PR #468). -
The
skll_convert
utility script can now convert feature files that do not contain labels (Issue #426, PR #453).
🛠 Bugfixes & Improvements 🛠
-
Fix several bugs in how various tuning objectives and output metrics were computed (Issues #545 & #548, PR #551).
-
Fix how
pos_label_str
is documented, read in, and used for classification tasks (Issues #550 & #570, PRs #566 & #571). -
Fix several bugs in the
generate_predictions
utility script and streamline its implementation to not rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues #484 & #562, PR #566). -
Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue #564, PR #567).
-
Using an externally specified
folds_file
for grid search now works forevaluate
andpredict
tasks, not justtrain
(Issue #536, PR #538). -
Fix incorrect application of sampling before feature scaling in
Learner.predict()
(Issue #472, PR #474). -
Disable feature sampling for
MultinomialNB
learner since it cannot handle negative values (Issue #473, PR #474). -
Add missing logger attribute to
Learner.FilteredLeaveOneGroupOut
(Issue #541, PR #543). -
Fix
FeatureSet.has_labels
to recognize list ofNone
objects which is what happens when you read in an unlabeled data set and passlabel_col=None
(Issue #426, PR #453). -
Fix bug in
ARFFWriter
that adds/removeslabel_col
from the field names even if it'sNone
to begin with (Issue #452, PR #453). -
Do not produce unnecessary warnings for learning curves (Issue #410, PR #458).
-
Show a warning when applying feature hashing to multiple feature files (Issue #461, PR #479).
-
Fix loading issue for saved
MultinomialNB
models (Issue #573, PR #574). -
Reduce memory usage for learning curve experiments by explicitly closing
matplotlib
figure instances after they are saved. -
Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the
newline
parameter when writing files.
📖 Documentation Updates 📖
-
Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the
Output
section (Issue #459, PR #568). -
Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue #448, PRs #547 & #552).
-
Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue #511, PR #519).
-
Update SKLL contribution guidelines and link to them from official documentation (Issues #498 & #514, PR #503 & #519).
-
Update documentation to indicate that
pandas
andseaborn
are now direct dependencies and not optional (Issue #553, PR #563). -
Update
LogisticRegression
learner documentation to talk explicitly about penalties and solvers (Issue #490, PR #500). -
Properly document the internal conversion of string labels to ints/floats and possible edge cases (Issue #436, PR #476).
-
Add feature scaling to Boston regression example (Issue #469, PR #478).
-
Several other additions/updates to documentation (Issue #459, PR #568).
✔️ Tests ✔️
-
Make
tests
into a package so that we can do something likefrom skll.tests.utils import X
etc. (Issue #530 , PR #531). -
Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues #529 & #544, PR #546).
-
Tweak tests to make test suite runnable on Windows (and pass!).
-
Add Azure Pipelines integration for automated test builds on Windows.
-
Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.
-
Current code coverage for SKLL tests is at 95%, the highest it has ever been!
🔍 Other changes 🔍
-
Replace
prettytable
with the more actively maintainedtabulate
(Issue #356, PR #467). -
Make sure entire codebase complies with PEP8 (Issue #460, PR #568).
-
Update the year to 2019 everywhere (Issue #447, PRs #456 & #568).
-
Update TravisCI configuration to use
conda_requirements.txt
for building environment (PR #515).
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Supreeth Baliga (@SupreethBaliga), Jeremy Biggs (@jbiggsets), Aoife Cahill (@aoifecahill), Ananya Ganesh (@ananyaganesh), R. Gokul (@rgokul), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Robert Pugh (@Lguyogiro), Maxwell Schwartz (@maxwell-schwartz), Eugene Tsuprun (@etsuprun), Avijit Vajpayee (@AVajpayeeJr), Mengxuan Zhao (@chaomenghsuan)