Rossmann Store Sales Prediction

01. Business Problem

Rossmann operates over 3,000 drugstore chains in 7 European countries. Store sales are influenced by various factors, including promotions, competition, school and state holidays, seasonality, and location. Currently, Rossmann store managers are tasked with predicting their daily sales up to six weeks in advance. With thousands of individual managers making predictions based on their unique circumstances, the accuracy of results can vary significantly.

As a Data Scientist, my task was to help predict these daily sales with a six-week lead time through Data Analysis and Machine Learning models and report the results to the CFO via a Telegram Bot.

02. Assumptions

The dataset contains sales made between 01/01/2013 and 07/31/2015.
The original variables/attributes (and their meanings) in the dataset are as follows:

Attributes	Meaning
store	unique id for each store
day_of_week	day of the week (1 to 7)
date	date
sales	financial volume of sales on the day
customers	number of customers on the day
open	indicator if the store was open (0 = closed, 1 = open)
promo	indicates if the store had a promotion on the day
state_holiday	indicates if it was a state/national holiday (a = public holiday, b = Easter, c = Christmas, 0 = none)
school_holiday	indicates if the store was affected by the closure of public schools on the day
store_type	indicates the store type (4 types: a, b, c, d)
assortment	describes the level of product assortment (a = basic, b = extra, c = extended)
competition_distance	distance in meters to the nearest competitor store
competition_open_since_month	month when the nearest competitor store opened
competition_open_since_year	year when the nearest competitor store opened
promo2	ongoing and consecutive promotion for some stores (0 = store did not participate, 1 = store participated)
promo2_since_week	indicates the week of the year when the store started participating in promo2
promo2_since_year	indicates the year when the store started participating in promo2 (complementary to promo2_since_week)
promo_interval	describes the consecutive intervals when promo2 started, indicating the months when the promotion started again (Ex.: "Feb,May,Aug,Nov" means each interval started in February, May, August, and November in a given year for a given store)

03. Solution Strategy

The strategy used was the CRISP method, divided into 10 parts:

Business question presented by the CFO.
Understanding the Rossmann Business.
Data Collection.
Data Cleaning.
Data Exploration.
Data Modeling.
Using Machine Learning Models.
Evaluation of Machine Learning Algorithm.
Model Deployment in Production.
Telegram Bot.

04. Key Business Insights

The following business hypotheses were raised based on how Rossmann operates:

Stores with a larger assortment should sell more.
Stores with closer competitors should sell less.
Stores with longer-standing competitors should sell more.
Stores with longer-running promotions should sell more.
Stores with more consecutive promotions should sell more.
Stores open during Christmas holidays should sell more.
Stores should sell more over the years.
Stores should sell more in the second half of the year.
Stores should sell more after the 10th day of each month.
Stores should sell less on weekends.
Stores should sell less during school holidays.

Among the 11 hypotheses, the most important ones considered by me were:

Stores with longer-standing competitors should sell more.

From the graph, we can see that the closer the competitor store opens to the current date (2015), the more Rossmann stores sell. This can also be seen from the positive correlation.

Stores should sell more in the second semester.

From the graph, we can see that sales drop significantly after month 6, with a strong negative correlation.

Stores should sell more after the 10th day of each month.

From the graph, we observe that store sales fluctuate throughout the days of the month. There is no strong correlation between the day of the month and sales.

Sales should be lower during school holidays.

Indeed, sales are lower during school holidays, but there is a detail in months 7 and 8 where sales are equal and higher. This is likely due to summer vacations.

05. Machine Learning

The following Machine Learning models were used to analyze sales prediction:

Average Model
Linear Regression
Linear Regression Regularized
Random Forest Regressor
XGBoost Regressor

After analyzing the errors (MAE, MAPE, and RMSE) below, I chose to proceed with the XGBoost Regressor. Although the Random Forest presented slightly better error values, it has a significantly larger size than the XGBoost Regressor.

Model Name	MAE	MAPE	RMSE
Random Forest Regressor	680.854054	0.100115	1012.687130
XGBoost Regressor	868.958205	0.130309	1238.550843
Average Model	1354.800353	0.206400	1835.135542
Linear Regression	1867.089774	0.292694	2671.049215
Linear Regression - Lasso	1891.704881	0.289106	2744.451741

After applying Cross-Validation, we can observe from the table below that the errors are close to the errors applied to the training dataset above, confirming that the model has good generalization capability.

Model Name	MAE CV	MAPE CV	RMSE CV
Random Forest Regressor	836.83 +/- 217.3	0.12 +/- 0.02	1255.04 +/- 316.89
XGBoost Regressor	1060.51 +/- 178.41	0.15 +/- 0.02	1512.59 +/- 241.37
Linear Regressoion	2081.73 +/- 295.63	0.3 +/- 0.02	2952.52 +/- 468.37
Linear Regression - Lasso	2116.38 +/- 341.5	0.29 +/- 0.01	3057.75 +/- 504.26

Finally, the final model, after adjusting the hyperparameters, was as follows:

Model Name	MAE	MAPE	RMSE
XGBoost Regressor	695.985642	0.102112	1007.379751

06. Business Model Translation

From the predictions made by the model, we can translate the error values found into estimated business values. In the table below are the model's predictions considering the worst and best financial scenarios.

Scenarios	Values
prediction	R$ 284,718,336.00
worst_scenario	R$ 283,938,442.77
best_scenario	R$ 285,498,228.15

07. Telegram Bot

After deploying the model to production, I created a Telegram bot that retrieves real-time sales predictions for the desired store:

08. Final Product and Conclusion

The final product was the Telegram bot that can be accessed by the CFO from any device connected to the internet and using Telegram, facilitating the budget definition for the renovation of each store. Additionally, the hypotheses generated valuable insights that can be used to increase sales.

09. Next Steps

Continue with additional CRISP cycles, focusing on:

Testing other Machine Learning models.
Filling missing data in other ways.
Testing other approaches to deal with outliers.
Exploring alternative strategies to define model hyperparameters.
Adding more query options to the Telegram Bot.

References

Dataset: https://www.kaggle.com/competitions/rossmann-store-sales

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
api		api
data		data
model		model
notebooks		notebooks
parameter		parameter
rossmann-telegram-api		rossmann-telegram-api
webapp		webapp
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
rossmann_sales_prediction.ipynb		rossmann_sales_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rossmann Store Sales Prediction

01. Business Problem

02. Assumptions

03. Solution Strategy

04. Key Business Insights

05. Machine Learning

06. Business Model Translation

07. Telegram Bot

08. Final Product and Conclusion

09. Next Steps

References

About

Releases

Packages

Languages

EDJR94/Data-Science-Em-Producao

Folders and files

Latest commit

History

Repository files navigation

Rossmann Store Sales Prediction

01. Business Problem

02. Assumptions

03. Solution Strategy

04. Key Business Insights

05. Machine Learning

06. Business Model Translation

07. Telegram Bot

08. Final Product and Conclusion

09. Next Steps

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages