Releases: EducationalTestingService/skll
SKLL 5.0.1
🛠 Minor Changes 🛠
SKLL v5.0.1 is a minor release with no changes for users.
- Updated pre-commit checks.
- Updated dependencies.
- Removed all dev dependencies from
requirements.txt
. - Updated versions in
doc/requirements.txt
.
- Removed all dev dependencies from
- Added new
requirements.dev
file. This file contains the runtime as well as dev dependencies. - Updated
CONTRIBUTING.md
to use this file instead of requirements.txt. - Excluded this file in
MANIFEST.in
so that it's not part of the PyPI package. - Updated CI pipelines to use
requirements.dev
instead ofrequirements.txt
. - Updated release process checklist.
Full Changelog: v5.0.0...v5.0.1
SKLL 5.0.0
💥 Breaking changes 💥
scikit-learn
has been updated to v1.4.0. This means that the SKLL experiments will likely yield different results compared to SKLL v4.0.1 (#766)- Python 3.8 and 3.9 are no longer supported since scikit-learn v1.4.0 doesn't support them.
- Compared to previous versions, additional information is included in the
results.json
output files produced when running experiments (#761).
💡 New features 💡
- SKLL results can now be automatically logged to Weights & Biases (#758, #761, #765)
- Python 3.12 is now supported.
🛠 Bugfixes & Improvements 🛠
- Fix ReadTheDocs config (#757)
Full Changelog: v4.0.1...v5.0.0
SKLL 4.0.1
SKLL 4.0.0
💥 Breaking changes 💥
scikit-learn
has been updated to v1.3.0. This could mean that the same SKLL experiments when run with SKLL 3.2.0 could yield different results.
💡 New features 💡
- Add
BaggingClassifier
andBaggingRegressor
support by @desilinguist in #742 - Add support for
HistGradientBoostingClassifier
andHistGradientBoostingRegressor
by @desilinguist in #743 - Include model fit times in learning curves by @desilinguist in #745
- Add
neg_root_mean_squared_error
metric and objective for regressors by @desilinguist in #741 - Add support for Python 3.11 by @desilinguist in #749
🛠 Bugfixes & Improvements 🛠
- Apply code formatting and other minor changes. by @desilinguist in #724
- Use
pathlib.Path
where possible. by @desilinguist in #725 - Migrate to new codecov uploader. by @desilinguist in #728
- Add type hints to
skll.config
module by @desilinguist in #729 - Add type hints to
skll.data
module & improve types inskll.config
by @desilinguist in #730 - Bug fix in feature set split method by @tamarl08 in #731
- Add type hints to
skll.experiments
module by @desilinguist in #732 - Add type hints to
skll.learner
module + other refactoring by @desilinguist in #734 - Add type hints in
skll.utils
module and in all other remaining files. by @desilinguist in #736 - Improve docstrings and create linkable type hints (Part 1) by @desilinguist in #737
- Improve docstrings and type hints (Part 2). by @desilinguist in #738
- Improve docstrings and type hints (Part 3) by @desilinguist in #739
- Improve docstrings & type hints (Part 4) by @desilinguist in #740
- Migrate tests to
nose2
instead ofnose
by @desilinguist in #747 - Stop using sklearn's private
_scorer
API for custom metrics in SKLL. by @desilinguist in #751 - Fix a few typos, etc. in the documentation by @mulhod in #712
🙏🏽 Code reviewers 🙏🏽
In no particular order: @dblandan, @mulhod, @Frost45, @tamarl08, @damien2012eng
New Contributors
Full Changelog: v3.2.0...v4.0.0
SKLL 3.2.0
What's Changed
- Update RTD requirements to fix failing build. by @desilinguist in #719
- Update dependencies, consolidate requirements, and tweak coverage by @desilinguist in #721
- Release v3.2.0 by @desilinguist in #722
Full Changelog: v3.1.0...v3.2.0
SKLL 3.1.0
This is a new release with with dependency updates, bugfixes, and improvements.
💥 Dependency Updates 💥
scikit-learn
has been updated to v1.1.2. This could mean that the same SKLL experiments when run with SKLL 3.1.0 could yield different results. (Issue #713, PR #716 ).
🛠 Bugfixes & Improvements 🛠
- SKLL Learners now support a new method
get_feature_names_out()
which returns the correct set of features actually used by the learner. Since some features might be removed by the feature selector, relying on the vectorizer vocabulary is not enough in those cases. This method allows easy access to the names of the actual features used, even if the selector has removed some features (Issue #714, PR #715). - Updated learning curve code to use the new API for
seaborn
v0.12.0 (PR #716) - Removed the Boston housing dataset from SKLL examples and tests. This dataset has ethical issues and is being removed from scikit-learn. (Issue #700, #717)
✔️ Tests ✔️
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Sanjna Kashyap (@Frost45), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), and Remo Nitschke (@remo-help).
SKLL 3.0.0
This is a major new release with with dependency updates and bugfixes!
⚡️ SKLL 3.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️
💥 Breaking Changes 💥
-
Python 3.7 is no longer officially supported while official support for Python 3.10 has been added (Issue #701, PR #711).
-
scikit-learn
has been updated to v1.0.1 (Issue #699, PR #702). -
The configuration field
pos_label_str
from the “Tuning" section has been renamed topos_label
. Older configuration files withpos_label_str
will now raise an exception (Issue #569, PR #706). -
The configuration field
log
from the “Output” section that was renamed tologs
in SKLL v2.5 has now been completely deprecated. Older configuration files withlog
will now raise an exception (Issue #671, PR #705).
💡 New features 💡
- SKLL now supports specifying custom seed values for cross-validation tasks. This option may be useful for running the same cross-validation experiment multiple times (with the same number of differently constituted folds) to get a sense of the variance across replicates (Issue #593, PR #707).
🛠 Bugfixes & Improvements 🛠
-
Using the
--drop-blanks
option withfilter_features
now raises a more useful error for the case when every single row in a tabular feature file has a blank column (Issue #693, PR #703). -
SKLL conda packages are again generic Python packages instead of platform-specific ones (Issue #710, PR #711).
📖 Documentation Updates 📖
-
Add a new section to the hands-on tutorial explaining how to first install SKLL in a virtual environment (Issue #689, PR #709).
-
Add missing link to SKLL repository in the tutorial data section (Issue #688, PR #691).
-
Update
CONTRIBUTING.md
to include more detailed instructions for pushing to the SKLL repository (Issue #680, PR #704). -
Link to the RSMTool implementation of
quadratic_weighted_kappa
which supports continuous values and can be used as a custom metric in SKLL for both hyper-parameter tuning as well as validation. See the quadratic_weighted_kappa bullet under the objectives section (Issue #512, PR #704). -
Continued readability improvements to function and method docstrings.
✔️ Tests ✔️
- All tests now specify
local=True
when makingrun_configuration()
calls. This ensures that tests always run in local mode and prevent an unnecessary check for thegridmap
library. (Issue #616, PR #708).
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Binod Gyawali (@bndgyawali), Robbie Imbrie (@RobertImbrie), Sanjna Kashyap (@Frost45), Sözen Ozkan Grigoras (@sozkangrigoras), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), and Damien Xie (@damien2012eng),
SKLL 2.5
This is a major new release with dozens of new features, bugfixes, and documentation updates!
⚡️ SKLL 2.5 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️
💥 Breaking Changes 💥
-
Python 3.6 is no longer officially supported since the latest versions of
pandas
andnumpy
have dropped support for it. -
Older top-level imports have been removed and should now be rewritten as follows (Issue #661, PR #662):
from skll import Learner
➡️from skll.learner import Learner
from skll import FeatureSet
➡️from skll.data import FeatureSet
from skll import run_configuration
➡️from skll.experiments import run_configuration
-
The default value for the
class_labels
keyword argument forLearner.predict()
is nowTrue
instead ofFalse
. Therefore, for probabilistic classifiers, this method will now return class labels by default instead of class probabilities. To obtain class probabilities, setclass_labels
toFalse
when calling this method (Issue #621, PR #622). -
The
filter_features
script now offers more intuitive command line options. Input files must be specified using the-i
/--input
and output files must be specified using the-o
/--output
. Additionally,--inverse
must now be used to invert the filtering command since-i
is used for input files (Issue #598, PR #660). -
The
MegaMReader
andMegaMWriter
classes have been removed from SKLL since.megam
files are no longer supported by SKLL (Issue #532, PR #557). -
The
param_grids
option in the configuration file is now a list of dictionaries instead of a list of list of dictionaries, one for each learner specified in thelearners
option. Correspondingly, the and theparam_grid
option inLearner.train()
andLearner.cross_validate()
is now a dictionary instead of a list of dictionaries and the default parameter grids for each learner are also simply dictionaries. (Issue #618, PR #619). -
Running a
learning_curve
task via a configuration file now requires at least 500 examples. Fewer examples will raise aValueError
. This behavior can only be overridden when usingLearner.learning_curve()
directly via the API (Issue #624, PR #631).
💡 New features 💡
-
VotingClassifier
andVotingRegressor
from scikit-learn are now available for use in SKLL. This was done by adding a newVotingLearner
class that usesLearner
instances to represent underlying estimators (Issue #488, PR #665). -
SKLL now supports custom, user-defined metrics for both hyperparameter tuning as well as evaluation (Issue #606, PR #612).
-
The following new built-in classification metrics are now available in SKLL:
f05
,f05_score_macro
,f05_score_micro
,f05_score_weighted
,jaccard
,jaccard_macro
,jaccard_micro
,jaccard_weighted
,precision_macro
,precision_micro
,precision_weighted
,recall_macro
,recall_micro
, andrecall_weighted
(Issues #609 and #610, PRs #607 and #612). -
scikit-learn
has been updated to 0.24.1 (Issue #653, PR #659).
🛠 Bugfixes & Improvements 🛠
-
Hyperparamter tuning now uses 5-fold cross-validation, instead of 3, to match the change in the default value of the
cv
parameter forGridSearchCV
. This will marginally increase the time taken for experiments with grid search but should produce more reliable results (Issue #487, PR #667). -
The SKLL codebase now uses sub-packages instead of very long modules which makes it easier to navigate and understand (Issue #600, PR #601).
-
The
log
configuration file option has been renamed tologs
. Usinglog
will still work but will raise a warning. Thelog
option will be removed entirely in the next release (Issue #520, PR #670). -
Learning curves are now correctly generated for probabilistic classifiers (Issue #648, PR #649).
-
Saving models in the current directory via
Learner.save()
no longer requires adding./
to the path (Issue #572, PR #604). -
The
filter_features
script no longer automatically assumes labels specified with-L
or--label
to be strings (Issue #598, PR #660). -
Remove the
create_label_dict
keyword argument fromLearner.train()
since it did not need to be user-facing (Issue #565, PR #605). -
Do not return 0 from correlation metrics when
NaN
is more appropriate. Doing this resulted in incorrect hyperparameter tuning results (Issue #585, PR #588). -
The
Learner._check_input_formatting()
private method now works correctly for dense featuresets (Issue #656, PR #658). -
SKLL conda packages are again platform-specific and the recipe now uses a
conda_build_config.yaml
to build the Python 3.7, 3.8, and 3.9 variants in one go (Issue #623, PR #XXX). -
Several useful changes to the SKLL code style:
- Standardize string concatenation (Issue #636, PR #645)
- Use
with
context manager when opening files (Issue #641, PR #644) - Use f-strings where possible (Issue #633, PR #634)
- Follow standard guidelines for sorting imports (Issue #638, PR #650)
- Use
pre-commit
hooks to enforce code formatting guidelines during development (Issue #646, PR #650)
📖 Documentation Updates 📖
-
Update
CONTRIBUTING.md
with the new sub-package structure of the SKLL codebase (Issue #611, PR #628). -
Add a section to the README that explains how to cite SKLL (Issue #599, PR #672).
-
Add Azure Pipelines badge to the README (Issue #608, PR #672).
-
Add explicit
.readthedocs.yml
file to configure the auto-built documentation (Issue #668, PR #672). -
Make it clear that not specifying
predictions
configuration file option leads to prediction files being output in the current directory (Issue #664, PR #672).
✔️ Tests ✔️
-
The Linux and Windows CI builds now use Python 3.7 and 3.8 respectively, instead of Python 3.6 (Issue #524, PR #665)
-
Both the Linux and Windows CI builds now use consistent
nosetests
commands (Issue #584, PR #665). -
nose-cov
is now automatically installed viaconda_requirements.txt
when setting up a development environment instead of requiring a separate step (Issue #527, PR #672). -
Add comprehensive new tests for voting learners, custom metrics, new built-in metrics, as well as for new bugfixes.
-
Current code coverage for SKLL tests is at 97%, the highest it has ever been!
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Sree Harsha Ramesh (@srhrshr)
SKLL 2.1
This is a minor release of SKLL with the only change being that it is now compatible with scikit-learn v0.22.2.
⚡️ There are several changes in scikit-learn v0.22 that might cause several estimators and functions to produce different results even when fit with the same data and parameters. Therefore, SKLL 2.1 can also yield different results compared to previous versions even with the same data and same settings. ⚡️
💡 New features 💡
🔎 Other minor changes 🔎
- Update imports to align with the new
scikit-learn
API. - A minor bugfix in
logutils.py
. - Update some test outputs due to changes in
scikit-learn
models and functions. - Update some tests to make pre-release testing for conda and PyPI packages possible.
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Matt Mulholland (@mulhod), Nitin Madnani (@desilinguist), and Mengxuan Zhao (@chaomenghsuan).
SKLL 2.0
This is a major new release. It's probably the largest SKLL release we have ever done since SKLL 1.0 came out! It includes dozens of new features, bugfixes, and documentation updates!
⚡️ SKLL 2.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️
💥 Incompatible Changes 💥
-
Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue #497, PR #506).
-
Configuration field
objective
has been deprecated and replaced withobjectives
which allows specifying multiple tuning objectives for grid search (Issue #381, PR #458). -
Grid search is now enabled by default in both the API as well as while using a configuration file (Issue #463, PR #465).
-
The
Predictor
class previously provided by thegenerate_predictions
utility script is no longer available. If you were relying on this class, you should just load the model file and callLearner.predict()
instead (Issue #562, PR #566). -
There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue #381, PR #458).
-
mean_squared_error
is no longer supported as a metric. Useneg_mean_squared_error
instead (Issue #382, PR #470). -
The
cv_folds_file
configuration file field is now just calledfolds_file
(Issue #382, PR #470). -
Running an experiment with the
learning_curve
task now requires specifyingmetrics
in theOutput
section instead ofobjectives
in theTuning
section (Issue #382, PR #470). -
Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue #364, PRs #475 & #518).
-
pandas
andseaborn
are now direct dependencies of SKLL, and not optional (Issues #455 & #364, PRs #475 & #508).
💡 New features 💡
-
CSVReader
/CSVWriter
&TSVReader
/TSVWriter
now usepandas
as the backend rather than custom code that relied on thecsv
module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks here (Issue #364, PRs #475 & #518). -
SKLL models now have a new
pipeline
attribute which makes it easy to manipulate and use them inscikit-
learn, if needed (Issue #451, PR #474). -
The SKLL conda package is now a generic Python package which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public ETS anaconda channel.
-
SKLL learner hyperparameters have been updated to match the new
scikit-learn
defaults and those upcoming in 0.22.0 (Issue #438, PR #533). -
Intermediate results for the grid search process are now available in the
results.json
files (Issue #431, #471). -
The K models trained for each split of a K-fold cross-validation experiment can now be saved to disk (Issue #501, PR #505).
-
Missing values in CSV/TSV files can be dropped/replaced both via the command line and the API (Issue #540, PR #542).
-
Warnings from
scikit-learn
are now captured in SKLL log files (issue #441, PR #480). -
Learner.model_params()
and, consequently, theprint_model_weights
utility script now work with models trained on hashed features (issue #444, PR #466). -
The
print_model_weights
utility script can now output feature weights sorted by class labels to improve readability (Issue #442, PR #468). -
The
skll_convert
utility script can now convert feature files that do not contain labels (Issue #426, PR #453).
🛠 Bugfixes & Improvements 🛠
-
Fix several bugs in how various tuning objectives and output metrics were computed (Issues #545 & #548, PR #551).
-
Fix how
pos_label_str
is documented, read in, and used for classification tasks (Issues #550 & #570, PRs #566 & #571). -
Fix several bugs in the
generate_predictions
utility script and streamline its implementation to not rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues #484 & #562, PR #566). -
Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue #564, PR #567).
-
Using an externally specified
folds_file
for grid search now works forevaluate
andpredict
tasks, not justtrain
(Issue #536, PR #538). -
Fix incorrect application of sampling before feature scaling in
Learner.predict()
(Issue #472, PR #474). -
Disable feature sampling for
MultinomialNB
learner since it cannot handle negative values (Issue #473, PR #474). -
Add missing logger attribute to
Learner.FilteredLeaveOneGroupOut
(Issue #541, PR #543). -
Fix
FeatureSet.has_labels
to recognize list ofNone
objects which is what happens when you read in an unlabeled data set and passlabel_col=None
(Issue #426, PR #453). -
Fix bug in
ARFFWriter
that adds/removeslabel_col
from the field names even if it'sNone
to begin with (Issue #452, PR #453). -
Do not produce unnecessary warnings for learning curves (Issue #410, PR #458).
-
Show a warning when applying feature hashing to multiple feature files (Issue #461, PR #479).
-
Fix loading issue for saved
MultinomialNB
models (Issue #573, PR #574). -
Reduce memory usage for learning curve experiments by explicitly closing
matplotlib
figure instances after they are saved. -
Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the
newline
parameter when writing files.
📖 Documentation Updates 📖
-
Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the
Output
section (Issue #459, PR #568). -
Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue #448, PRs #547 & #552).
-
Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue #511, PR #519).
-
Update SKLL contribution guidelines and link to them from official documentation (Issues #498 & #514, PR #503 & #519).
-
Update documentation to indicate that
pandas
andseaborn
are now direct dependencies and not optional (Issue #553, PR #563). -
Update
LogisticRegression
learner documentation to talk explicitly about penalties and solvers (Issue #490, PR #500). -
Properly document the internal conversion of string labels to ints/floats and possible edge cases (Issue #436, PR #476).
-
Add feature scaling to Boston regression example (Issue #469, PR #478).
-
Several other additions/updates to documentation (Issue #459, PR #568).
✔️ Tests ✔️
-
Make
tests
into a package so that we can do something likefrom skll.tests.utils import X
etc. (Issue #530 , PR #531). -
Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues #529 & #544, PR #546).
-
Tweak tests to make test suite runnable on Windows (and pass!).
-
Add Azure Pipelines integration for automated test builds on Windows.
-
Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.
-
Current code coverage for SKLL tests is at 95%, the highest it has ever been!
🔍 Other changes 🔍
-
Replace
prettytable
with the more actively maintainedtabulate
(Issue #356, PR #467). -
Make sure entire codebase complies with PEP8 (Issue #460, PR #568).
-
Update the year to 2019 everywhere (Issue #447, PRs #456 & #568).
-
Update TravisCI configuration to use
conda_requirements.txt
for building environment (PR #515).
👩🔬 Contributors 👨🔬
(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Supreeth Baliga (@SupreethBaliga), Jeremy Biggs (@jbiggsets), Aoife Cahill (@aoifecahill), Ananya Ganesh (@ananyaganesh), R. Gokul (@rgokul), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Robert Pugh (@Lguyogiro), Maxwell Schwartz (@maxwell-schwartz), Eugene Tsuprun (@etsuprun), Avijit Vajpayee (@AVajpayeeJr), Mengxuan Zhao (@chaomenghsuan)