-
Notifications
You must be signed in to change notification settings - Fork 0
/
ReportPart1.qmd
2494 lines (1766 loc) · 105 KB
/
ReportPart1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Time series regression - Store sales forecasting, part 1"
author: "Ahmet Zamanis"
format:
gfm:
toc: true
code-fold: true
code-summary: "Show code"
editor: visual
jupyter: python3
execute:
warning: false
---
## Introduction
This is a report on time series analysis & regression modeling, performed in Python, mainly with the [Darts](https://unit8co.github.io/darts/) library. The dataset is from the [Kaggle Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting) competition. The data consists of daily sales data for an Ecuadorian supermarket chain between 2013 and 2017. This is part 1 of the analysis, which will focus only on forecasting the daily national sales of the chain, aggregated across all stores and categories. In part 2, we will forecast the sales in each category - store combination as required by the competition, and apply hierarchical reconciliation methods.
The main information source used extensively for this analysis is the textbook [Forecasting: Principles and Practice](https://otexts.com/fpp3/), written by Rob J. Hyndman and George Athanasopoulos. The book is the most complete source on time series analysis & forecasting I could find. It uses R and the [tidyverts](https://tidyverts.org/) libraries for its examples, which have the best time series analysis workflow among the R libraries I've tried.
```{python Settings}
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set printing options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)
# Set plotting options
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams["figure.autolayout"] = True
```
## Data preparation
```{python Load data}
# Load original datasets
df_train = pd.read_csv("./OriginalData/train.csv", encoding="utf-8")
df_stores = pd.read_csv("./OriginalData/stores.csv", encoding="utf-8")
df_oil = pd.read_csv("./OriginalData/oil.csv", encoding="utf-8")
df_holidays = pd.read_csv("./OriginalData/holidays_events.csv", encoding="utf-8")
df_trans = pd.read_csv("./OriginalData/transactions.csv", encoding="utf-8")
```
The data is split into several .csv files. **train.csv** and **test.csv** are the main datasets, consisting of daily sales data. The training data ranges from 01-01-2013 to 15-08-2017, and the testing data consists of the following 16 days until the end of August 2017. We won't do a competition submission in part 1, so we won't work on the testing data.
```{python df}
print(df_train.head(5))
```
- For each day, we have the sales in each store (out of a possible 54) and each product category (out of a possible 33). This amounts to 1782 time series that need to be forecasted for the competition.
- **onpromotion** is the number of items on sale in one day, in one category & store combination.
**stores.csv** contains more information about each store: The city, state, store type and store cluster.
```{python df_stores}
print(df_stores.head(5))
```
**holidays.csv** contains information about local (city-wide), regional (state-wide) and national holidays, and some special nation-wide events that occured in the time period. We will use these along with the stores' location data to create calendar features, to adjust for effects of holidays and special events on sales.
```{python df_holidays}
print(df_holidays.head(5))
```
**oil.csv** consists of the daily oil prices in the time period. Ecuador has an oil-dependent economy, so this may be an useful predictor of the cyclicality in supermarket sales. We don't have the oil price for the first day of the data.
```{python df_oil}
print(df_oil.head(5))
```
**transactions.csv** consists of the daily number of transactions at each store. Note that only store 25 is recorded for the first day, so it's likely all stores aren't open in all days.
```{python df_trans}
print(df_trans.head(5))
```
We will rename some columns from the datasets and merge the supplementary information into the main sales dataset. We'll aggregate daily transactions across all stores beforehand, as we are only interested in predicting daily national sales.
```{python MergeData}
# Rename columns
df_holidays = df_holidays.rename(columns = {"type":"holiday_type"})
df_oil = df_oil.rename(columns = {"dcoilwtico":"oil"})
df_stores = df_stores.rename(columns = {
"type":"store_type", "cluster":"store_cluster"})
# Aggregate daily transactions across all stores
df_trans_agg = df_trans.groupby("date").transactions.sum()
# Add columns from oil, stores and transactions datasets into main data
df_train = df_train.merge(df_trans_agg, on = ["date"], how = "left")
df_train = df_train.merge(df_oil, on = "date", how = "left")
df_train = df_train.merge(df_stores, on = "store_nbr", how = "left")
```
Incorporating the holidays information into the sales dataset will require more work. We'll create a four binary columns that record whether a day is a local, regional or national holiday, or had a special event.
```{python SplitHolidays}
# Split holidays data into local, regional, national and events
events = df_holidays[df_holidays["holiday_type"] == "Event"].copy()
df_holidays = df_holidays.drop(labels=(events.index), axis=0)
local = df_holidays.loc[df_holidays["locale"] == "Local"].copy()
regional = df_holidays.loc[df_holidays["locale"] == "Regional"].copy()
national = df_holidays.loc[df_holidays["locale"] == "National"].copy()
```
There are cases of multiple holidays or events sharing the same date and locale. We'll inspect the duplicates and drop them so we don't have issues in feature engineering.
- Rows with **transferred = True** are dates that are normally holidays, but the holiday was transferred to another date. In other words, these are not holidays in effect.
- Rows with **holiday_type = Transfer** are dates that are not normally holidays, but had another holiday transferred to this date. In other words, these are holidays in effect.
- Rows with **holiday_type = Bridge** are dates that are not normally holidays, but were added to extend preceding / following holidays.
```{python Duplicates1}
# Inspect local holidays sharing same date & locale
print(local[local.duplicated(["date", "locale_name"], keep = False)])
```
```{python DropDuplicates 1}
# Drop the transfer row
local = local.drop(265, axis = 0)
```
```{python Duplicates2}
# Inspect regional holidays sharing same date & locale. None exist
print(regional[regional.duplicated(["date", "locale_name"], keep = False)])
```
```{python Duplicates3}
# Inspect national holidays sharing same date & locale
print(national[national.duplicated(["date"], keep = False)])
```
```{python DropDuplicates3}
# Drop bridge days
national = national.drop([35, 39, 156], axis = 0)
```
```{python Duplicates4}
# Inspect events sharing same date
print(events[events.duplicated(["date"], keep = False)])
```
```{python DropDuplicates4}
# Drop the earthquake row as it is a one-time event
events = events.drop(244, axis = 0)
```
After getting rid of duplicates, we can create binary columns that signify whether a date was a local / regional / national holiday / event. We'll merge these back into the sales data.
```{python HolidayEventDummies}
# Add local_holiday binary column to local holidays data, to be merged into main
# data.
local["local_holiday"] = (
local.holiday_type.isin(["Transfer", "Additional", "Bridge"]) |
((local.holiday_type == "Holiday") & (local.transferred == False))
).astype(int)
# Add regional_holiday binary column to regional holidays data
regional["regional_holiday"] = (
regional.holiday_type.isin(["Transfer", "Additional", "Bridge"]) |
((regional.holiday_type == "Holiday") & (regional.transferred == False))
).astype(int)
# Add national_holiday binary column to national holidays data
national["national_holiday"] = (
national.holiday_type.isin(["Transfer", "Additional", "Bridge"]) |
((national.holiday_type == "Holiday") & (national.transferred == False))
).astype(int)
# Add event column to events
events["event"] = 1
# Merge local holidays binary column to main data, on date and city
local_merge = local.drop(
labels = [
"holiday_type", "locale", "description", "transferred"], axis = 1).rename(
columns = {"locale_name":"city"})
df_train = df_train.merge(local_merge, how="left", on=["date", "city"])
df_train["local_holiday"] = df_train["local_holiday"].fillna(0).astype(int)
# Merge regional holidays binary column to main data
regional_merge = regional.drop(
labels = [
"holiday_type", "locale", "description", "transferred"], axis = 1).rename(
columns = {"locale_name":"state"})
df_train = df_train.merge(regional_merge, how="left", on=["date", "state"])
df_train["regional_holiday"] = df_train["regional_holiday"].fillna(0).astype(int)
# Merge national holidays binary column to main data, on date
national_merge = national.drop(
labels = [
"holiday_type", "locale", "locale_name", "description",
"transferred"], axis = 1)
df_train = df_train.merge(national_merge, how="left", on="date")
df_train["national_holiday"] = df_train["national_holiday"].fillna(0).astype(int)
# Merge events binary column to main data
events_merge = events.drop(
labels = [
"holiday_type", "locale", "locale_name", "description",
"transferred"], axis = 1)
df_train = df_train.merge(events_merge, how="left", on="date")
df_train["event"] = df_train["event"].fillna(0).astype(int)
```
We'll set the **date** column to a DateTimeIndex, and view the sales data with the added columns. Remember that the transactions column is now the national transactions across all stores in one day.
```{python DatetimeIndex}
# Set datetime index
df_train = df_train.set_index(pd.to_datetime(df_train.date))
df_train = df_train.drop("date", axis=1)
print(df_train.head(5))
```
With financial data, it's a good idea to perform CPI adjustment before analysis, to remove the effects of inflation / deflation on our models and predictions. We'll CPI adjust the sales and oil prices columns with 2010 as our base year. The CPI values for Ecuador in the time period were retrieved [here](https://data.worldbank.org/indicator/FP.CPI.TOTL?end=2017&locations=EC&start=2010&view=chart).
- We'll use the yearly CPI values for simplicity, but it's possible to use monthly CPI as well.
```{python CPIAdjust}
# CPI adjust sales and oil, with CPI 2010 = 100
cpis = {
"2010": 100, "2013": 112.8, "2014": 116.8, "2015": 121.5, "2016": 123.6,
"2017": 124.1
}
for year in [2013, 2014, 2015, 2016, 2017]:
df_train.loc[df_train.index.year == year, "sales"] = df_train.loc[
df_train.index.year == year, "sales"] / cpis[str(year)] * cpis["2010"]
df_train.loc[df_train.index.year == year, "oil"] = df_train.loc[
df_train.index.year == year, "oil"] / cpis[str(year)] * cpis["2010"]
```
We have some columns with missing values in our training data.
```{python NACheck}
# Check missing values in each column
pd.isnull(df_train).sum()
```
We will interpolate the missing values in the oil and transactions columns using time interpolation.
- This performs linear interpolation, but also takes the date-time index of observations into account.
- If we don't set `limit_direction = "both"`, the first value for oil won't be interpolated.
```{python NAInterpolate}
df_train["oil"] = df_train["oil"].interpolate("time", limit_direction = "both")
df_train["transactions"] = df_train["transactions"].interpolate(
"time", limit_direction = "both")
```
We will now aggregate daily sales across all categories and stores, to retrieve our target variable. We have 1684 days of national sales data.
```{python AggSales}
sales = df_train.groupby("date").sales.sum()
sales
```
We will create a Darts [TimeSeries](https://unit8co.github.io/darts/generated_api/darts.timeseries.html) with our target variable.
```{python TSTarget}
from darts import TimeSeries
ts_sales = TimeSeries.from_series(
sales,
freq="D" # Time series frequency is daily
)
print(ts_sales)
```
- Each Pandas series / dataframe column passed is stored as a component in the Darts TS. The date-time index is stored in `.time_index`**.** We had 1684 rows in our Pandas series, but the Darts TS has 1688 dates. This means our series had some missing dates, which Darts created automatically. We'll fill in the values for these dates later.
- To create a multivariate time series, we create a Pandas dataframe with each time series as a column, and a common date-time index. When we pass this dataframe to TimeSeries, we'll have each time series as a component. If the multivariate time series has a **hierarchy**, i.e. if they sum up together in a certain way, we can map that hierarchy as a dictionary to later perform hierarchical reconciliation. We will explore this further in part 2 of the analysis.
- **Static covariates** are time-invariant covariates that may be used as predictors in global models (models trained on multiple time series at once). They are stored together with the target series in the Darts TS. In our case, the location, type or cluster of a store may be used as static covariates, but for part 1 of our analysis we are looking at national sales, so these aren't applicable.
## Overview of hybrid modeling approach
A time series can be written as the sum of several components:
- **Trend:** The long-term change.
- **Seasonality:** A fluctuation (or several) that repeats based on a fixed, known time period, often related to how we measure time and schedule our lives. For example, the fluctuation of retail store sales across days of a week, or electricity usage across hours of a day.
- **Cyclicality:** A fluctuation that does not repeat based on a fixed, known time period, often caused by temporary factors. For example, the effect of a sharp increase / decrease in oil prices on car sales.
- **Remainder / Error:** The unpredictable component of the time series, at least with the available data and methods.
When analyzing a time series with plots, it can be difficult to determine the nature and causes of fluctuations. It can be especially tricky to tell apart the cyclical effects from repeating seasonality. Because of this, we will split our analysis and modeling into two steps:
- In step 1, we will analyze the time effects: The trend, seasonality and calendar effects (such as holidays & events). We'll build a model that predicts these effects and remove the predictions from the time series, leaving the effects of cyclicality and the unpredictable component. This is called **time decomposition.**
- In step 2, we will re-analyze the time-decomposed time series, this time considering the effects of covariates and lagged values of sales itself as predictors, to try and account for the cyclicality. We'll build a model that uses these predictors, train it on the decomposed sales, and add up the predictions of both models to arrive at our final predictions.
- This approach is called a **hybrid model.** It allows us to differentiate and separately model different time series components. We can also use more than one type of algorithm and combine their strengths.
## Exploratory analysis 1: Time & calendar effects
### Trend
Let's start by analyzing the overall trend in sales, with a simple time series plot. Darts offers the ability to plot time series quickly.
```{python TrendPlot}
_ = ts_sales.plot()
_ = plt.ylabel("Daily sales, millions")
plt.show()
plt.close()
```
The time series plot shows us several things:
- Supermarket sales show an increasing trend over the years. The trend is close to linear overall, but the rate of increase declines roughly from the start of 2015. Consider one straight line from 2013 to 2015, and a second, less steep one from 2015 to the end.
- Sales mostly fluctuate around the trend with a repeating pattern, which suggests strong seasonality. However, there are also sharp deviations from the trend in certain periods, mainly across 2014 and at the start of 2015. These are likely cyclical in nature.
- The "waves" of seasonal fluctuations seem to be getting bigger over time. This suggests we should use a multiplicative time decomposition instead of additive.
- Sales decline very sharply in the first day of every year, almost to zero.
### Seasonality
#### Annual seasonality
Let's look at annual seasonality: How sales fluctuate over a year based on quarters, months, weeks of a year and days of a year. In the plots below, we have the daily sales averaged by each respective calendar period, colored by each year in the data. The confidence bands indicate the minimum and maximum daily sales in each respective period (in the last plot, we just have the daily sales without any averaging).
```{python AnnualSeasonality}
# FIG1: Annual seasonality, period averages
fig1, axes1 = plt.subplots(2,2, sharey=True)
_ = fig1.suptitle('Annual seasonality: Average daily sales in given time periods, millions')
# Average sales per quarter of year
_ = sns.lineplot(
ax = axes1[0,0],
x = sales.index.quarter.astype(str),
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales, legend=False)
_ = axes1[0,0].set_xlabel("quarter", fontsize=8)
_ = axes1[0,0].set_ylabel("sales", fontsize=8)
_ = axes1[0,0].tick_params(axis='both', which='major', labelsize=6)
# Average sales per month of year
_ = sns.lineplot(
ax = axes1[0,1],
x = sales.index.month.astype(str),
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales)
_ = axes1[0,1].set_xlabel("month", fontsize=8)
_ = axes1[0,1].set_ylabel("sales",fontsize=8)
_ = axes1[0,1].legend(title = "year", bbox_to_anchor=(1.05, 1.0), fontsize="small", loc='best')
_ = axes1[0,1].tick_params(axis='both', which='major', labelsize=6)
# Average sales per week of year
_ = sns.lineplot(
ax = axes1[1,0],
x = sales.index.week,
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales, legend=False)
_ = axes1[1,0].set_xlabel("week of year", fontsize=8)
_ = axes1[1,0].set_ylabel("sales",fontsize=8)
_ = axes1[1,0].tick_params(axis='both', which='major', labelsize=6)
_ = axes1[1,0].xaxis.set_ticks(np.arange(0, 52, 10))
# Average sales per day of year
_ = sns.lineplot(
ax = axes1[1,1],
x = sales.index.dayofyear,
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales, legend=False)
_ = axes1[1,1].set_xlabel("day of year", fontsize=8)
_ = axes1[1,1].set_ylabel("sales",fontsize=8)
_ = axes1[1,1].tick_params(axis='both', which='major', labelsize=6)
_ = axes1[1,1].xaxis.set_ticks(np.arange(0, 365, 100))
# Show fig1
plt.show()
plt.close("all")
```
- **Quarterly:** Sales do not seem to have a considerable quarterly seasonality pattern. However, the plot still shows us a few things:
- Sales generally slightly increase over a year, but the trend may be stagnating / declining at the end of our data in 2017.
- In Q2 2014, there was a considerable drop. Sales declined almost to the level of Q2 2013. This was likely a cyclical effect.
- **Monthly:** Sales do seem to fluctuate slightly over months, but there's no clear seasonal pattern that's apparent across all years. However, sales seem to sharply increase in November and December every year, likely due to Christmas.
- The cyclicality in 2014 is seen in more detail: Sales dropped almost to their 2013 levels in certain months, and recovered sharply in others.
- We also see a considerable drop in the first half of 2015, where sales dropped roughly to 2014 levels, followed by a recovery.
- There is a sharp increase in April 2016, where sales were as high as 2017 levels. This is due to a large earthquake that happened in April 16 2016, and the related relief efforts.
- **Weekly:** The seasonal patterns are more visible in the weekly plot, as we see the "waves" of fluctuation line up across years. It's very likely the data has strong weekly seasonality, which is what we'd expect from supermarket sales.
- The data for 2017 ends after August 15, so the sharp decline afterwards is misleading, though there may still be a stagnation / decline in the overall trend.
- The sharp decline at the end of 2016 is also misleading, as 2016 was a 366-day year.
- **Daily:** This plot is a bit noisy, but the very similar fluctuations across all years indicate the data is strongly seasonal. It also highlights some cyclical effects such as the April 2016 earthquake and the 2014 drops.
Another way to look at annual seasonality is to average sales in a certain calendar period across all years, without grouping by year.
```{python AnnualSeasonalityAgg}
# FIG1.1: Annual seasonality, averaged over years
fig11, axes11 = plt.subplots(2,2, sharey=True)
_ = fig11.suptitle('Annual seasonality: Average daily sales in given time periods,\n across all years, millions');
# Average sales per quarter of year
_ = sns.lineplot(
ax = axes11[0,0],
x = sales.index.quarter.astype(str),
y = (sales / 1000000),
data=sales, legend=False)
_ = axes11[0,0].set_xlabel("quarter", fontsize=8)
_ = axes11[0,0].set_ylabel("sales", fontsize=8)
_ = axes11[0,0].tick_params(axis='both', which='major', labelsize=6)
# Average sales per month of year
_ = sns.lineplot(
ax = axes11[0,1],
x = sales.index.month.astype(str),
y = (sales / 1000000),
data=sales)
_ = axes11[0,1].set_xlabel("month", fontsize=8)
_ = axes11[0,1].set_ylabel("sales",fontsize=8)
_ = axes11[0,1].tick_params(axis='both', which='major', labelsize=6)
# Average sales per week of year
_ = sns.lineplot(
ax = axes11[1,0],
x = sales.index.week,
y = (sales / 1000000),
data=sales, legend=False)
_ = axes11[1,0].set_xlabel("week of year", fontsize=8)
_ = axes11[1,0].set_ylabel("sales",fontsize=8)
_ = axes11[1,0].tick_params(axis='both', which='major', labelsize=6)
_ = axes11[1,0].xaxis.set_ticks(np.arange(0, 52, 10))
# Average sales per day of year
_ = sns.lineplot(
ax = axes11[1,1],
x = sales.index.dayofyear,
y = (sales / 1000000),
data=sales, legend=False)
_ = axes11[1,1].set_xlabel("day of year", fontsize=8)
_ = axes11[1,1].set_ylabel("sales",fontsize=8)
_ = axes11[1,1].tick_params(axis='both', which='major', labelsize=6)
_ = axes11[1,1].xaxis.set_ticks(np.arange(0, 365, 100))
# Show fig1.1
plt.show()
plt.close("all")
```
This shows us the "overall" seasonality pattern across all years: We likely have a strong weekly seasonality pattern that holds across the years, and some monthly seasonality especially towards December.
#### Monthly & weekly seasonality
Now let's look at seasonality across days of a month and days of a week. These will likely be the most important seasonality patterns in supermarket sales. There are three ways to look at these plots: First we will group them by year.
```{python MonthlyWeeklyByYear}
# FIG2: Monthly and weekly seasonality
fig2, axes2 = plt.subplots(2)
_ = fig2.suptitle('Monthly and weekly seasonality, average daily sales in millions')
# Average sales per day of month
_ = sns.lineplot(
ax = axes2[0],
x = sales.index.day,
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales)
_ = axes2[0].legend(title = "year", bbox_to_anchor=(1.05, 1.0), fontsize="small", loc='best')
_ = axes2[0].set_xlabel("day of month", fontsize=8)
_ = axes2[0].set_ylabel("sales", fontsize=8)
_ = axes2[0].xaxis.set_ticks(np.arange(1, 32, 6))
_ = axes2[0].xaxis.set_ticks(np.arange(1, 32, 1), minor=True)
_ = axes2[0].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
_ = axes2[0].grid(which='minor', alpha=0.5)
_ = axes2[0].grid(which='major', alpha=1)
# Average sales per day of week
_ = sns.lineplot(
ax = axes2[1],
x = (sales.index.dayofweek+1).astype(str),
y = (sales / 1000000),
hue = sales.index.year.astype(str), data=sales, legend=False)
_ = axes2[1].set_xlabel("day of week", fontsize=8)
_ = axes2[1].set_ylabel("sales", fontsize=8)
_ = axes2[1].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
# Show fig2
plt.show()
plt.close("all")
```
- The weekly seasonality pattern across days of a week is clear. Sales are lowest in Thursdays, increase and peak at Sundays, then drop on Mondays. The pattern holds in all years.
- The monthly seasonality across days of a month isn't as strong, but looks considerable. Sales are generally highest at the start of a month, likely because most salaries are paid at the end of a month, though the competition information also says salaries are paid biweekly, in the middle and at the end of each month.
We can look at the same plots grouped by month. The confidence bands are suppressed as they make the plots hard to read.
```{python MonthlyYearlyByMonth}
# FIG2.1: Monthly and weekly seasonality, colored by month
fig21, axes21 = plt.subplots(2)
_ = fig21.suptitle('Monthly and weekly seasonality, average daily sales in millions')
# Average sales per day of month, colored by month
_ = sns.lineplot(
ax = axes21[0],
x = sales.index.day,
y = (sales / 1000000),
hue = sales.index.month.astype(str), data=sales, errorbar=None)
_ = axes21[0].legend(title = "month", bbox_to_anchor=(1.05, 1.0), fontsize="x-small", loc='best')
_ = axes21[0].set_xlabel("day of month", fontsize=8)
_ = axes21[0].set_ylabel("sales", fontsize=8)
_ = axes21[0].xaxis.set_ticks(np.arange(1, 32, 6))
_ = axes21[0].xaxis.set_ticks(np.arange(1, 32, 1), minor=True)
_ = axes21[0].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
_ = axes21[0].grid(which='minor', alpha=0.5)
_ = axes21[0].grid(which='major', alpha=1)
# Average sales per day of week, colored by month
_ = sns.lineplot(
ax = axes21[1],
x = (sales.index.dayofweek+1).astype(str),
y = (sales / 1000000),
hue = sales.index.month.astype(str), data=sales, errorbar=None, legend=None)
_ = axes21[1].set_xlabel("day of week", fontsize=8)
_ = axes21[1].set_ylabel("sales", fontsize=8)
_ = axes21[1].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
# Show fig2.1
plt.show()
plt.close("all")
```
This plot shows us the monthly and weekly seasonality pattern generally holds across all months, but December is a notable exception:
- Sales in December are considerably higher roughly from the 13th, due to Christmas approaching. The Christmas peak seems to happen in the 23th, followed by a decline, and another sharp increase in the 30th.
- The sales by day of week are also higher in December for every day, but the pattern of the weekly seasonality is the same.
- We can also see that sales decline very sharply in January 1st, almost to zero, and make a considerably sharp recovery in the 2nd.
And finally, without any grouping: The averages across all years and months.
```{python MonthlyYearly}
# FIG2.2: Monthly and weekly seasonality, average across years
fig22, axes22 = plt.subplots(2)
_ = fig22.suptitle('Monthly and weekly seasonality, average daily sales in millions')
# Average sales per day of month
_ = sns.lineplot(
ax = axes22[0],
x = sales.index.day,
y = (sales / 1000000),
data=sales)
_ = axes22[0].set_xlabel("day of month", fontsize=8)
_ = axes22[0].set_ylabel("sales", fontsize=8)
_ = axes22[0].xaxis.set_ticks(np.arange(1, 32, 6))
_ = axes22[0].xaxis.set_ticks(np.arange(1, 32, 1), minor=True)
_ = axes22[0].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
_ = axes22[0].grid(which='minor', alpha=0.5)
_ = axes22[0].grid(which='major', alpha=1)
# Average sales per day of week
_ = sns.lineplot(
ax = axes22[1],
x = (sales.index.dayofweek+1).astype(str),
y = (sales / 1000000),
data=sales)
_ = axes22[1].set_xlabel("day of week", fontsize=8)
_ = axes22[1].set_ylabel("sales", fontsize=8)
_ = axes22[1].yaxis.set_ticks(np.arange(0, 1.25, 0.25))
# Show fig22
plt.show()
plt.close("all")
```
This plot allows us to see the overall monthly seasonality pattern more clearly: Sales are higher at the start of a month, slightly decline until the mid-month payday, slightly increase afterwards, and start peaking again towards the end of a month.
### Autocorrelation & partial autocorrelation
Autocorrelation is the correlation of a variable with its own past values. Partial autocorrelation can be thought of as the marginal contribution of each lagged value to autocorrelation, since the lags are likely to hold common information. The patterns in the autocorrelation and partial autocorrelation plots can give us insight into the seasonal patterns.
```{python AcfPacfTime}
# FIG3: ACF and PACF plots
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig3, axes3 = plt.subplots(2)
_ = fig3.suptitle('Autocorrelation and partial autocorrelation, daily sales, up to 54 days')
_ = plot_acf(sales, lags=range(0,55), ax=axes3[0])
_ = plot_pacf(sales, lags=range(0,55), ax=axes3[1], method="ywm")
# Show fig3
plt.show()
plt.close("all")
```
The sinusoidal pattern in the ACF plot is typical for strong weekly seasonality:
- Sales at T=0 are highly correlated with the sales of the previous day at T-1.
- The correlation declines until T-6, which is the previous value of the next weekday from T=0, and spikes again at T-7, which is the previous value of the same weekday.
- The pattern repeats weekly after T-7, with declining strength.
The PACF plot shows the marginal contribution of each lag to the relationship with the present day value. The partial autocorrelation is highest for lag 1, and significant correlations die out roughly after 14 days. In step 2 of this analysis, we will revisit this plot to derive features from sales lags, after performing time decomposition with model 1.
### April 2016 Earthquake
One more time effect we will look at is the April 2016 earthquake, a one-off occurrence we need to adjust for in our model. The plots below show the seasonality of sales in April and May, in every year. The black dashed vertical line marks April 16.
```{python EarthquakePlot}
# FIG6: Zoom in on earthquake: 16 April 2016
april_sales = sales.loc[sales.index.month == 4]
may_sales = sales.loc[sales.index.month == 5]
fig6, axes6 = plt.subplots(2, sharex=True)
_ = fig6.suptitle("Effect of 16 April 2016 earthquake on sales")
# April
_ = sns.lineplot(
ax = axes6[0],
x = april_sales.index.day,
y = april_sales,
hue = april_sales.index.year.astype(str),
data = april_sales
)
_ = axes6[0].set_title("April")
_ = axes6[0].legend(title = "year", bbox_to_anchor=(1.05, 1.0), fontsize="small", loc='best')
_ = axes6[0].set_xlabel("days of month")
_ = axes6[0].set_xticks(np.arange(1, 32, 6))
_ = axes6[0].set_xticks(np.arange(1, 32, 1), minor=True)
_ = axes6[0].grid(which='minor', alpha=0.5)
_ = axes6[0].grid(which='major', alpha=1)
_ = axes6[0].axvline(x = 16, color = "black", linestyle = "dashed")
# May
_ = sns.lineplot(
ax = axes6[1],
x = may_sales.index.day,
y = may_sales,
hue = may_sales.index.year.astype(str),
data = may_sales, legend=False
)
_ = axes6[1].set_title("May")
_ = axes6[1].set_xlabel("days of month")
_ = axes6[1].set_xticks(np.arange(1, 32, 6))
_ = axes6[1].set_xticks(np.arange(1, 32, 1), minor=True)
_ = axes6[1].grid(which='minor', alpha=0.5)
_ = axes6[1].grid(which='major', alpha=1)
# Show FIG6
plt.show()
plt.close("all")
```
With the earthquake, the sales sharply increased from the expected seasonal pattern and trend, peaked in April 18, then declined and returned to the seasonal expectations around April 22. It doesn't seem like sales were higher than normal for the rest of April 2016, or in May 2016. We'll engineer our earthquake feature accordingly.
## Feature engineering: Time & calendar features
Our main candidate for the time decomposition model is a simple linear regression, so we will focus our feature engineering towards that.
- With linear regression, we can model and extrapolate trends and seasonality patterns, as well as adjust for one-off calendar effects.
- Methods that can't extrapolate outside the training values are not suitable for modeling time effects, even if they are more "advanced".
### Calendar effects & weekly seasonality features
We'll flag the dates that influence sales considerably with binary or ordinal features. These columns will take non-zero values if a certain calendar effect is present in a date, and zero values otherwise.
Sales decline very sharply every year in January 1st, almost to zero. The sales in January 2nd then recover sharply, which is a huge relative increase for one day, so we will flag both dates with dummy features.
```{python FeatEngNY}
# New year's day features
df_train["ny1"] = ((df_train.index.day == 1) & (df_train.index.month == 1)).astype(int)
# Set holiday dummies to 0 if NY dummies are 1
df_train.loc[df_train["ny1"] == 1, ["local_holiday", "regional_holiday", "national_holiday"]] = 0
df_train["ny2"] = ((df_train.index.day == 2) & (df_train.index.month == 1)).astype(int)
df_train.loc[df_train["ny2"] == 1, ["local_holiday", "regional_holiday", "national_holiday"]] = 0
```
We have considerably higher sales in December due to Christmas and the New Years' Eve.
- For NY's eve effects, we can simply flag December 30 and 31st with dummy features.
- For Christmas effects, we will create two integer columns that reflect the "strength" of the Christmas effect on sales, based on day of month. The first column will reach its maximum value in December 23rd, and the second will decline as the date moves away from that. For dates outside December 13-27, these features will take a value of zero. This may not be the most sophisticated way to reflect the Christmas effects on our model, but it should be simple and effective.
```{python FeatEngDecember}
# NY's eve features
df_train["ny_eve31"] = ((df_train.index.day == 31) & (df_train.index.month == 12)).astype(int)
df_train["ny_eve30"] = ((df_train.index.day == 30) & (df_train.index.month == 12)).astype(int)
df_train.loc[(df_train["ny_eve31"] == 1) | (df_train["ny_eve30"] == 1), ["local_holiday", "regional_holiday", "national_holiday"]] = 0
# Proximity to Christmas sales peak
df_train["xmas_before"] = 0
df_train.loc[
(df_train.index.day.isin(range(13,24))) & (df_train.index.month == 12), "xmas_before"] = df_train.loc[
(df_train.index.day.isin(range(13,24))) & (df_train.index.month == 12)].index.day - 12
df_train["xmas_after"] = 0
df_train.loc[
(df_train.index.day.isin(range(24,28))) & (df_train.index.month == 12), "xmas_after"] = abs(df_train.loc[
(df_train.index.day.isin(range(24,28))) & (df_train.index.month == 12)].index.day - 27)
df_train.loc[(df_train["xmas_before"] != 0) | (df_train["xmas_after"] != 0), ["local_holiday", "regional_holiday", "national_holiday"]] = 0
```
To account for the effect of the April 2016 earthquake, we create a feature similar to the ones for Christmas. The value peaks at April 18th and takes a value of zero for dates outside April 17-22, 2016.
```{python FeatEngQuake}
# Strength of earthquake effect on sales
# April 18 > 17 > 19 > 20 > 21 > 22
df_train["quake_after"] = 0
df_train.loc[df_train.index == "2016-04-18", "quake_after"] = 6
df_train.loc[df_train.index == "2016-04-17", "quake_after"] = 5
df_train.loc[df_train.index == "2016-04-19", "quake_after"] = 4
df_train.loc[df_train.index == "2016-04-20", "quake_after"] = 3
df_train.loc[df_train.index == "2016-04-21", "quake_after"] = 2
df_train.loc[df_train.index == "2016-04-22", "quake_after"] = 1
```
We previously created binary features to indicate local-regional-national holidays, and special events. There are only a few different events in the dataset, and they differ in nature, so we will break up the events column and create separate binary features for each event. We already created a feature for the earthquake, so we'll skip that one.
```{python FeatEngEvents}
# Split events, delete events column
df_train["dia_madre"] = ((df_train["event"] == 1) & (df_train.index.month == 5) & (df_train.index.day.isin([8,10,11,12,14]))).astype(int)
df_train["futbol"] = ((df_train["event"] == 1) & (df_train.index.isin(pd.date_range(start = "2014-06-12", end = "2014-07-13")))).astype(int)
df_train["black_friday"] = ((df_train["event"] == 1) & (df_train.index.isin(["2014-11-28", "2015-11-27", "2016-11-25"]))).astype(int)
df_train["cyber_monday"] = ((df_train["event"] == 1) & (df_train.index.isin(["2014-12-01", "2015-11-30", "2016-11-28"]))).astype(int)
df_train = df_train.drop("event", axis=1)
```
Holidays and events may lead to an increase in sales in advance, so we will create one lag column for each holiday column, and the Mother's Day event. These features could be tailored more carefully according to each holiday and event, but we'll keep it simple.
```{python FeatEngLeads}
# Holiday-event leads
df_train["local_lead1"] = df_train["local_holiday"].shift(-1).fillna(0)
df_train["regional_lead1"] = df_train["regional_holiday"].shift(-1).fillna(0)
df_train["national_lead1"] = df_train["national_holiday"].shift(-1).fillna(0)
df_train["diamadre_lead1"] = df_train["dia_madre"].shift(-1).fillna(0)
```
To capture the weekly seasonality in a simple manner, we'll create 6 dummy features for days of the week. We won't create one for Monday, since a 0 value for the other 6 columns means Monday.
```{python FeatEngDaysWeek}
# Days of week dummies
df_train["tuesday"] = (df_train.index.dayofweek == 1).astype(int)
df_train["wednesday"] = (df_train.index.dayofweek == 2).astype(int)
df_train["thursday"] = (df_train.index.dayofweek == 3).astype(int)
df_train["friday"] = (df_train.index.dayofweek == 4).astype(int)
df_train["saturday"] = (df_train.index.dayofweek == 5).astype(int)
df_train["sunday"] = (df_train.index.dayofweek == 6).astype(int)
```
Now we will aggregate the time features by mean. For local and regional holidays, this will give us fractional values between 0 and 1, which is likely a decent way to reflect local and regional holidays' effects on national sales.
```{python FeatEngTimeAgg}
# Aggregate time features by mean
time_covars = df_train.drop(
columns=['id', 'store_nbr', 'family', 'sales', 'onpromotion', 'transactions', 'oil', 'city', 'state', 'store_type', 'store_cluster'], axis=1).groupby("date").mean(numeric_only=True)
```
### Trend & monthly seasonality features
In the exploratory analysis, we saw the overall trend can be modeled linearly, but the rate of increase (the slope of the trend line) declines from the start of 2015. To capture this, we will model a piecewise linear trend with one knot at 01-01-2015.
- The slope of the linear trend will change and decline once, at the knot date. In effect, the result will be two linear trend lines added together.
- It's possible to add more "turns" to a linear trend with more knots, but one is enough in our case. We want the trend to be simple, and robust against seasonal or cyclical fluctuations.
```{python FeatEngTrend1}
# Add piecewise linear trend dummies
time_covars["trend"] = range(1, 1685) # Linear dummy 1
# Knot to be put at period 729
print(time_covars.loc[time_covars.index=="2015-01-01"]["trend"])
```
```{python FeatEngTrend2}
# Add second linear trend dummy
time_covars["trend_knot"] = 0
time_covars.iloc[728:,-1] = range(0, 956)
# Check start and end of knot
print(time_covars.loc[time_covars["trend"]>=729][["trend", "trend_knot"]])
```
For the monthly seasonality, we will create Fourier features. This will capture the slight increases and decreases in sales throughout a month, mostly due to proximity to paydays.
- Named after the French mathematician, Fourier series can be used to model any periodic, repeating fluctuation / waveform as sums of numerous sine-cosine pairs.
- We'll create 5 Fourier pairs (10 columns in total) for 28-period seasonality.
- Too few pairs may not capture the fluctuations well, while too many run the risk of overfitting.
- December was an exception to the monthly seasonality pattern, but our Christmas and NY features will adjust for that.
```{python FeatEngFourier}
from statsmodels.tsa.deterministic import DeterministicProcess
# Add Fourier features for monthly seasonality
dp = DeterministicProcess(
index = time_covars.index,
constant = False,
order = 0, # No trend feature
seasonal = False, # No seasonal dummy features
period = 28, # 28-period seasonality (28 days, 1 month)
fourier = 5, # 5 Fourier pairs
drop = True # Drop perfectly collinear terms
)
time_covars = time_covars.merge(dp.in_sample(), how="left", on="date")
# View Fourier features
print(time_covars.iloc[0:5, -10:])
```
## Model 1: Time effects decomposition
We will now build our time decomposition linear regression model in Darts, and compare it with some baseline models, as well as some models that are quicker to implement.
### Preprocessing
```{python TSTimeCovars}
# Create Darts time series with time features
ts_timecovars = TimeSeries.from_dataframe(
time_covars, freq = "D", fill_missing_dates = False)
```
Our target and covariate Series had 1684 rows each, but the Darts TimeSeries we create from them have 1688 dates. This is likely because we have gaps in our original series. We can check this easily in Darts.
```{python TSGaps}
# Scan for gaps
print(ts_timecovars.gaps())
```
It seems our data is missing values for December 25th in every year (the data for 2017 ends in August). Darts automatically filled in the missing dates to 1688, but we need to fill the missing values in our target and covariate series.
```{python FillGaps}
# Fill gaps by interpolating missing values
from darts.dataprocessing.transformers import MissingValuesFiller
na_filler = MissingValuesFiller()
ts_sales = na_filler.transform(ts_sales)
ts_timecovars = na_filler.transform(ts_timecovars)
# Scan for gaps again
print(ts_timecovars.gaps())
```
Our exploratory analysis showed that the seasonal fluctuations in sales become larger over time. We should use a multiplicative decomposition instead of additive.
- In **additive decomposition**, we decompose a time series as **Total =** **Trend + Seasonality + Remainder.**
- In **multiplicative decomposition**, we decompose a time series as **Total =** **Trend \* Seasonality \* Remainder.**
One way of performing multiplicative decomposition is taking the logarithm of the time series and performing additive decomposition, because if
$Y = T * S * R$,
then
$log(Y) = log(T) + log(S) + log(R)$ .
```{python LogExpTrafos}
# Define functions to perform log transformation and reverse it. +1 to avoid zeroes
def trafo_log(x):
return x.map(lambda x: np.log(x+1))
def trafo_exp(x):
return x.map(lambda x: np.exp(x)-1)
```
We'll train our time decomposition models on the natural logarithm of the daily sales data from 2013-2016, and validate their prediction performance on 2017. We use an entire year of data for validation to see if our features can capture long-term seasonality and fixed calendar effects.
```{python TrainValSplit1}
# Train-validation split: Pre 2017 vs 2017
y_train1, y_val1 = trafo_log(ts_sales[:-227]), trafo_log(ts_sales[-227:])
```
### Model specification
We'll compare the performance of our linear regression model against a few other models.
Naive drift and naive seasonal are two baseline models, meant to represent the performance of a very simple prediction.
- Naive drift fits a straight line from the start to the end of the training data, and extrapolates it to make predictions.
- Naive seasonal simply repeats the last K values in the training set.
We'll also test two seasonal models which require little input. These models cannot use other covariates in their predictions, such as our calendar features.