This repo was made as part of project work done for Machine Learning Laboratory(MSDS 699) course at University of San Francisco's Master's in Data Science program.
The data we have chosen was taken from a Kaggle competition https://www.kaggle.com/c/tmdb-box-office-prediction/data
Contributors:
The goal of our project was to predict the revenue of movies at the box office.
- Data processing which included missing value imputataion and encoding categorical features.
- Feature engineering - deriving meaningful features like age of the movie.
- Building pipeline to fit various machine learning models.
- Evaluating the models using relevant metrics and defining a North Star metric.
- Choosing the best model and visually inspecting our results.
For our problem statement we chose a baseline model as a linear model (Ridge Regression)
We fit the following models to our data:
- Ridge Regression
- KNeighboursRegressor
- BayesianRidge
- RandomForestRegressor
- XGBoost
Out of these models, we observed that RandomForest model performs the best -
- MedAE score(in million$) - 12.98 (North Star metric)
- R2 score - 0.85
- RMSLE - 1.68
- 'Budget' of the movie is the most important predictor as per permutation feature importance, which makes a lot of sense with respect to the business implications.
- Difficult to accurately predict the movie’s box office performance because of various missing data points such as:
- Overall economy at the time of the movie release
- Quality of the movie’s plot and other exogenous factors
- Presence of streaming service like Amazon Prime, Netflix etc.
- Overall economy at the time of the movie release
In order to run our notebook and reproduce the results, the following steps can be followed:
Clone the repository using the given code:
git clone https://github.com/ShreejayaB/TMDB-Box-Office-Predictions
- Run the following command to create the virtual environment named 'tmdb_box_office_pred_ml':
conda env create -f tmdb_box_office_pred_venv.yml -n tmdb_box_office_pred_ml
- Activate this virtual environment with the following command:
conda activate tmdb_box_office_pred_ml
Start the IPython notebook server from the root directory, with the jupyter notebook
command.