Rossmann operates over 3,000 drugstore chains in 7 European countries. Store sales are influenced by various factors, including promotions, competition, school and state holidays, seasonality, and location. Currently, Rossmann store managers are tasked with predicting their daily sales up to six weeks in advance. With thousands of individual managers making predictions based on their unique circumstances, the accuracy of results can vary significantly.
As a Data Scientist, my task was to help predict these daily sales with a six-week lead time through Data Analysis and Machine Learning models and report the results to the CFO via a Telegram Bot.
- The dataset contains sales made between 01/01/2013 and 07/31/2015.
- The original variables/attributes (and their meanings) in the dataset are as follows:
Attributes | Meaning |
---|---|
store | unique id for each store |
day_of_week | day of the week (1 to 7) |
date | date |
sales | financial volume of sales on the day |
customers | number of customers on the day |
open | indicator if the store was open (0 = closed, 1 = open) |
promo | indicates if the store had a promotion on the day |
state_holiday | indicates if it was a state/national holiday (a = public holiday, b = Easter, c = Christmas, 0 = none) |
school_holiday | indicates if the store was affected by the closure of public schools on the day |
store_type | indicates the store type (4 types: a, b, c, d) |
assortment | describes the level of product assortment (a = basic, b = extra, c = extended) |
competition_distance | distance in meters to the nearest competitor store |
competition_open_since_month | month when the nearest competitor store opened |
competition_open_since_year | year when the nearest competitor store opened |
promo2 | ongoing and consecutive promotion for some stores (0 = store did not participate, 1 = store participated) |
promo2_since_week | indicates the week of the year when the store started participating in promo2 |
promo2_since_year | indicates the year when the store started participating in promo2 (complementary to promo2_since_week) |
promo_interval | describes the consecutive intervals when promo2 started, indicating the months when the promotion started again (Ex.: "Feb,May,Aug,Nov" means each interval started in February, May, August, and November in a given year for a given store) |
The strategy used was the CRISP method, divided into 10 parts:
- Business question presented by the CFO.
- Understanding the Rossmann Business.
- Data Collection.
- Data Cleaning.
- Data Exploration.
- Data Modeling.
- Using Machine Learning Models.
- Evaluation of Machine Learning Algorithm.
- Model Deployment in Production.
- Telegram Bot.
The following business hypotheses were raised based on how Rossmann operates:
- Stores with a larger assortment should sell more.
- Stores with closer competitors should sell less.
- Stores with longer-standing competitors should sell more.
- Stores with longer-running promotions should sell more.
- Stores with more consecutive promotions should sell more.
- Stores open during Christmas holidays should sell more.
- Stores should sell more over the years.
- Stores should sell more in the second half of the year.
- Stores should sell more after the 10th day of each month.
- Stores should sell less on weekends.
- Stores should sell less during school holidays.
Among the 11 hypotheses, the most important ones considered by me were:
- Stores with longer-standing competitors should sell more.
From the graph, we can see that the closer the competitor store opens to the current date (2015), the more Rossmann stores sell. This can also be seen from the positive correlation.
- Stores should sell more in the second semester.
From the graph, we can see that sales drop significantly after month 6, with a strong negative correlation.
- Stores should sell more after the 10th day of each month.
From the graph, we observe that store sales fluctuate throughout the days of the month. There is no strong correlation between the day of the month and sales.
- Sales should be lower during school holidays.
Indeed, sales are lower during school holidays, but there is a detail in months 7 and 8 where sales are equal and higher. This is likely due to summer vacations.
The following Machine Learning models were used to analyze sales prediction:
- Average Model
- Linear Regression
- Linear Regression Regularized
- Random Forest Regressor
- XGBoost Regressor
After analyzing the errors (MAE, MAPE, and RMSE) below, I chose to proceed with the XGBoost Regressor. Although the Random Forest presented slightly better error values, it has a significantly larger size than the XGBoost Regressor.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Random Forest Regressor | 680.854054 | 0.100115 | 1012.687130 |
XGBoost Regressor | 868.958205 | 0.130309 | 1238.550843 |
Average Model | 1354.800353 | 0.206400 | 1835.135542 |
Linear Regression | 1867.089774 | 0.292694 | 2671.049215 |
Linear Regression - Lasso | 1891.704881 | 0.289106 | 2744.451741 |
After applying Cross-Validation, we can observe from the table below that the errors are close to the errors applied to the training dataset above, confirming that the model has good generalization capability.
Model Name | MAE CV | MAPE CV | RMSE CV |
---|---|---|---|
Random Forest Regressor | 836.83 +/- 217.3 | 0.12 +/- 0.02 | 1255.04 +/- 316.89 |
XGBoost Regressor | 1060.51 +/- 178.41 | 0.15 +/- 0.02 | 1512.59 +/- 241.37 |
Linear Regressoion | 2081.73 +/- 295.63 | 0.3 +/- 0.02 | 2952.52 +/- 468.37 |
Linear Regression - Lasso | 2116.38 +/- 341.5 | 0.29 +/- 0.01 | 3057.75 +/- 504.26 |
Finally, the final model, after adjusting the hyperparameters, was as follows:
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
XGBoost Regressor | 695.985642 | 0.102112 | 1007.379751 |
From the predictions made by the model, we can translate the error values found into estimated business values. In the table below are the model's predictions considering the worst and best financial scenarios.
Scenarios | Values |
---|---|
prediction | R$ 284,718,336.00 |
worst_scenario | R$ 283,938,442.77 |
best_scenario | R$ 285,498,228.15 |
After deploying the model to production, I created a Telegram bot that retrieves real-time sales predictions for the desired store:
The final product was the Telegram bot that can be accessed by the CFO from any device connected to the internet and using Telegram, facilitating the budget definition for the renovation of each store. Additionally, the hypotheses generated valuable insights that can be used to increase sales.
Continue with additional CRISP cycles, focusing on:
- Testing other Machine Learning models.
- Filling missing data in other ways.
- Testing other approaches to deal with outliers.
- Exploring alternative strategies to define model hyperparameters.
- Adding more query options to the Telegram Bot.
Dataset: https://www.kaggle.com/competitions/rossmann-store-sales