Predicting Spotify Song Popularity: A Refactoring Journey

In this case study, we will show how a machine learning use case that is implemented as a Jupyter notebook (which was taken from Kaggle, was originally implemented by Saurav Palekar and is licensed under the Apache 2 license) can be successively refactored in order to

improve the software design in general, achieving a high degree clarity and maintainability,
gain flexibility for experimentation,
appropriately track results,
arrive at a solution that can straightforwardly be deployed for production.

The use case considers a dataset from kaggle containing meta-data on approximately one million songs (see download instructions below). The goal is to use the data in order to learn a model for the prediction of song popularity given other song attributes such as the tempo, the release year, the key, the musical mode, etc.

Preliminaries

Make sure you have created the Python virtual environment, set up a project in your IDE and downloaded the data as described in the root README file.

How to use this package?

This package is organised as follows:

There is one folder per step in the refactoring process with a dedicated README file explaining the key aspects of the respective step.
There is an independent Python implementation of the use case in each folder, which you should inspect alongside the README file.

The intended way of exploring this package is to clone the repository and open it in your IDE of choice, such that you can browse it with familiar tools and navigate the code efficiently.

Diffing

To more clearly see the concrete changes from one step to another, you can make use of a diff tool. To support this, you may run the Python script generate_repository.py in order to create a git repository in folder refactoring-repo that references the state of each step in a separate tag, i.e. in said folder, you could run, for example,

    git difftool step04-model-specific-pipelines step05-sensai

Steps in the Journey

These are the steps of the journey:

Monolithic Notebook

This is the starting point, a Jupyter notebook which is largely unstructured.
Python Script

This step extracts the code that is strictly concerned with the training and evaluation of models.
Dataset Representation

This step introduces an explicit representation for the dataset, making transformations explicit as well as optional.
Refactoring

This step improves the code structure by adding function-specific Python modules.
Model-Specific Pipelines

This step refactors the pipeline to move all transforming operations into the models, enabling different models to use entirely different pipelines.
sensAI

This step introduces the high-level library sensAI, which will enable more flexible, declarative model specifications down the line.
Feature Representation

This step separates representations of features and their properties from the models that use them, allowing model input pipelines to be flexibly composed.
Feature Engineering

This step adds an engineered feature to the mix.
High-Level Evaluation

This step applies sensAI's high-level abstraction for model evaluation, enabling logging.
Tracking Experiments

This step adds tracking functionality via sensAI's mlflow integration (and additionally by saving results directly to the file system).
Regression

This step considers the perhaps more natural formulation of the prediction problem as a regression problem.
Hyperparameter Optimisation

This step adds hyperparameter optimisation for the XGBoost regression model.
Cross-Validation

This step adds the option to use cross-validation.
Deployment

This step adds a web service for inference, which is packaged in a docker container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Predicting Spotify Song Popularity: A Refactoring Journey

Preliminaries

How to use this package?

Diffing

Steps in the Journey

Files

README.md

Latest commit

History

README.md

File metadata and controls

Predicting Spotify Song Popularity: A Refactoring Journey

Preliminaries

How to use this package?

Diffing

Steps in the Journey