In this case study, we will show how a machine learning use case that is implemented as a Jupyter notebook (which was taken from Kaggle, was originally implemented by Saurav Palekar and is licensed under the Apache 2 license) can be successively refactored in order to
- improve the software design in general, achieving a high degree clarity and maintainability,
- gain flexibility for experimentation,
- appropriately track results,
- arrive at a solution that can straightforwardly be deployed for production.
The use case considers a dataset from kaggle containing meta-data on approximately one million songs (see download instructions below). The goal is to use the data in order to learn a model for the prediction of song popularity given other song attributes such as the tempo, the release year, the key, the musical mode, etc.
Make sure you have created the Python virtual environment, set up a project in your IDE and downloaded the data as described in the root README file.
This package is organised as follows:
- There is one folder per step in the refactoring process with a dedicated README file explaining the key aspects of the respective step.
- There is an independent Python implementation of the use case in each folder, which you should inspect alongside the README file.
The intended way of exploring this package is to clone the repository and open it in your IDE of choice, such that you can browse it with familiar tools and navigate the code efficiently.
To more clearly see the concrete changes from one step to another, you can make use
of a diff tool.
To support this, you may run the Python script
generate_repository.py
in order to create a git repository in folder refactoring-repo
that references
the state of each step in a separate tag, i.e. in said folder, you could run, for example,
git difftool step04-model-specific-pipelines step05-sensai
These are the steps of the journey:
-
This is the starting point, a Jupyter notebook which is largely unstructured.
-
This step extracts the code that is strictly concerned with the training and evaluation of models.
-
This step introduces an explicit representation for the dataset, making transformations explicit as well as optional.
-
This step improves the code structure by adding function-specific Python modules.
-
This step refactors the pipeline to move all transforming operations into the models, enabling different models to use entirely different pipelines.
-
This step introduces the high-level library sensAI, which will enable more flexible, declarative model specifications down the line.
-
This step separates representations of features and their properties from the models that use them, allowing model input pipelines to be flexibly composed.
-
This step adds an engineered feature to the mix.
-
This step applies sensAI's high-level abstraction for model evaluation, enabling logging.
-
This step adds tracking functionality via sensAI's mlflow integration (and additionally by saving results directly to the file system).
-
This step considers the perhaps more natural formulation of the prediction problem as a regression problem.
-
This step adds hyperparameter optimisation for the XGBoost regression model.
-
This step adds the option to use cross-validation.
-
This step adds a web service for inference, which is packaged in a docker container.