This repository contains the Python code and data to reproduce the results presented in the paper: A. Occhipinti*, L. Rogers*, C. Angione, "A pipeline and comparative study of 12 machine learning models for text classification", Expert Systems with Applications, 201 (2022): 117193
The following steps are required to run the code:
- Python 3.6.x is required, a check is specific put into the code before it continues.
- Jupyter notebook server is required
- Enron spam corpus dataset is used for this paper, included is the tar zip folders containing the spam emails.
- AV application's will flag some emails as malicious/virus or a scam, this is fine and restore where necessary.
- Ensure all pip dependencies are installed as listed in requirements.txt
- Run through the steps laid out in the notebook.