-
Notifications
You must be signed in to change notification settings - Fork 1
/
customer-segmentation.Rmd
977 lines (683 loc) · 48 KB
/
customer-segmentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
---
title: "Customer Segmentation for Targeted Marketing"
author: "Akshit Jain"
date: "5/12/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(readr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyverse)
library(DataExplorer)
library(lubridate)
library(agricolae)
library(sf)
library(raster)
library(dplyr)
library(spData)
library(tm)
library(tmap)
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(FactoMineR)
library(wordcloud)
library(httr)
library(xml2)
library(rvest)
library(doParallel)
library(fmsb)
library(scales)
library(rpart)
library(rpart.plot)
library(kableExtra)
```
```{r image1, echo=FALSE, out.width = '100%'}
knitr::include_graphics("customer-seg-img.jpg")
```
# Introduction
Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.
From knowing which products to buy, how many of them and when, to marketing the right products to the right customers at the right time, there are plenty of uses for data in retail - from the biggest multi-nationals to the smallest, single-outlet shop. By analysing customer purchase and product sales history, we can group products and customers into groups that behave similarly, and make data-driven business decisions that can improve a wide range of inventory and sales key performance indicators (KPIs).
The specific data for this analysis comes from the UCI Machine Learning Repository and represents transactional data from a UK retailer from 2010-2011. This mostly represents sales to wholesalers so it is slightly different from consumer purchase patterns but is still a useful case study.
**Note:** To access the code for this project, kindly check out the GitHub repository linked below.
# Scope
1. Analyze the sales trends, market profitability, order cancellations and product categories through exporatory data analysis.
2. Leverage the product descriptions to better understand what product categories interest our customers the most.
3. Implement K-means and Hierarchical clustering algorithms to segment customers in order to gain insight into shopping behaviors, analyze product affinity, measure marketing effectiveness, and better allocate future marketing spend.
# Data Preparation
Let's start by loading the dataset and get a feel for its size and the class of each variable:
```{r prepare-data-1, echo=FALSE, warning=FALSE, results='asis'}
initial_df = read_csv("data.csv", col_types = cols())
initial_df$CustomerID = as.character(initial_df$CustomerID)
kable(initial_df[1:5, ], caption = "A glimpse of the dataset") %>% kable_styling()
```
The shape of the dataframe is: `r dim(initial_df)`
```{r prepare-data-1.1, echo=FALSE, warning=FALSE}
options(repr.plot.width=8, repr.plot.height=3)
# look for missing values using the DataExplorer package
plot_missing(initial_df,
geom_label_args = list("size" = 3, "label.padding" = unit(0.1, "lines")),
ggtheme = theme_minimal())
```
Looking at the size of the dataset and the missing value plot, it is interesting to note that ∼ 25% of the entries are not assigned to a particular customer, if we can remove the missing values we can still have a good-sized set of data to work on. Moreover, with the data available, it is impossible to impute values for the customers and these entries are thus useless for our analysis, so let's start by removing the missing values:
```{r prepare-data-1.2, echo=FALSE, warning=FALSE}
initial_df = na.omit(initial_df)
options(repr.plot.width=8, repr.plot.height=3)
# look for missing values using the DataExplorer package
plot_missing(initial_df,
geom_label_args = list("size" = 3, "label.padding" = unit(0.1, "lines")),
ggtheme = theme_minimal())
```
The shape of the dataframe after removing NA values: `r dim(initial_df)`
# Feature Engineering
Variables that pop out are <b>InvoiceDate</b>, <b>Quantity</b> and <b>Unit Price</b>.
- <b>InvoiceDate</b> is a character variable, but we can pull out the date and time information to create two new variables. We'll also create separate variables for month, year and hour of day.
- <b>Quantity</b> and <b>Unit Price</b> will be used to create a column <b>BasketPrice</b>.
```{r prepare-data-1.3, echo=FALSE, warning=FALSE, results='asis'}
initial_df = separate(initial_df, col = c("InvoiceDate"),
into = c("InvoiceDate", "InvoiceTime"), sep = " ")
initial_df = separate(initial_df, col = c("InvoiceDate"),
into = c("Month", "Day", "Year"), sep = "/",
remove = FALSE)
initial_df = initial_df %>% dplyr::select(-Day)
initial_df = separate(initial_df, col = c("InvoiceTime"),
into = c("HourOfDay", "Minutes"), sep = ":",
remove = FALSE)
initial_df = initial_df %>% dplyr::select(-Minutes)
initial_df$InvoiceDate = as.Date(initial_df$InvoiceDate, "%m/%d/%Y")
initial_df$DayOfWeek = wday(initial_df$InvoiceDate, label = TRUE)
initial_df = initial_df %>% mutate(BasketPrice = Quantity * UnitPrice)
# Finally, I check for duplicate entries and delete them:
initial_df = dplyr::distinct(initial_df)
initial_df$Country <- as.factor(initial_df$Country)
initial_df$Month<- as.factor(initial_df$Month)
initial_df$Year <- as.factor(initial_df$Year)
levels(initial_df$Year) <- c(2010,2011)
initial_df$HourOfDay<- as.factor(initial_df$HourOfDay)
initial_df$DayOfWeek <- as.factor(initial_df$DayOfWeek)
kable(initial_df[1:5, ], caption = "Dataset with new features added") %>%
kable_styling(font_size = 8)
```
We now have a good dataframe to explore and analyze the sales trends, market profitability, order cancellations and product categories. Before we move on to getting involved with extracting product categories and perform customer segmentation, we'll look at some of the bigger features of the dataset.
# Exploratory Data Analysis
This dataframe contains 8 features + 6 engineered features = 14 features that correspond to:
- <b>InvoiceNo:</b> Invoice number. *Nominal*, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'C', it indicates a cancellation.
- <b>StockCode:</b> Product (item) code. *Nominal*, a 5-digit integral number uniquely assigned to each distinct product.
- <b>Description:</b> Product (item) name. *Nominal*.
- <b>Quantity:</b> The quantities of each product (item) per transaction. Numeric.
- <b>InvoiceDate:</b> Invice Date and time. *Date*, the day and time when each transaction was generated.
- <b>UnitPrice:</b> Unit price. *Numeric*, Product price per unit in sterling.
- <b>CustomerID:</b> Customer number. *Nominal*, a 5-digit integral number uniquely assigned to each customer.
- <b>Country:</b> Country name. *Nominal*, the name of the country where each customer resides.
Summary of engineered features: <b>Month</b>, <b>Year</b>, <b>InvoiceTime</b>, <b>HourOfDay</b>, <b>DayOfWeek</b>, <b>BasketPrice</b>
### Revenue By Date
```{r explore-data-1, echo=FALSE, warning=FALSE}
initial_df %>%
group_by(InvoiceDate) %>% summarise(Revenue = sum(BasketPrice)) %>%
ggplot(aes(x = InvoiceDate, y = Revenue)) +
geom_line() +
geom_smooth(formula = y~x, method = "loess", se = TRUE) +
labs(x = "Date", y = "Revenue (£)", title = "Sales Revenue by Date")
```
It appears as though sales are trending up, so that's a good sign, but that doesn't really generate any actionable insight, so let's dive into the data a bit farther.
### Day of Week Analysis
Using the *lubridate* package, we assigned a day of the week to each date in our dataset. Generally, people tend to be in a different frame of mind as the week goes on. Are people more likely to spend as the week goes on? Browsing to pass a Sunday afternoon? Procrastinating on that Friday afternoon at work? Cheering yourself up after a difficult Monday? Also, since a lot of our customers are wholesale buyers, do they fill up their inventories on a regular basis? Is there a pattern in their purchasing history?
Let's drill into the days of the week side of our data and see what we can uncover about our sales trends.
```{r explore-data-1.1, echo=FALSE, warning=FALSE}
initial_df %>%
group_by(DayOfWeek) %>% summarise(Revenue = sum(BasketPrice)) %>%
ggplot(aes(x = DayOfWeek, y = Revenue)) +
geom_bar(stat = "identity", fill = 'steelblue') +
labs(x = "Day of Week", y = "Revenue (£)", title = "Sales Revenue by Day of Week")
```
It looks like there could be something interesting going on with the amount of revenue that is generated on each particular weekday. What about Saturday? Let's drill into this a little bit more by creating a new dataframe that we can use to look at what's going on at the day of the week level in a bit more detail:
```{r explore-data-1.2, echo=FALSE, warning=FALSE, results='asis'}
weekday_summary = initial_df %>%
group_by(InvoiceDate, DayOfWeek) %>%
summarise(Revenue = sum(BasketPrice), Transactions = n_distinct(InvoiceNo)) %>%
mutate(AverageOrderVal = round((Revenue/ Transactions), 2)) %>%
ungroup()
kable(weekday_summary[1:5, ], caption = "Summary of Weekday Transactions") %>%
kable_styling()
```
We now have a dataframe that summarises what is happening on each day, with our *DayOfWeek* present and a few of newly engineered variables, daily *Revenue*, *Transactions* and *AverageOrderVal*, we can drill into our data a bit more thoroughly.
```{r explore-data-1.3, echo=FALSE, warning=FALSE}
weekday_summary %>%
ggplot(aes(x = DayOfWeek, y = Revenue)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Revenue (£)", title = "Sales Revenue by Day of Week")
```
```{r explore-data-1.4, echo=FALSE, warning=FALSE}
weekday_summary %>%
ggplot(aes(x = DayOfWeek, y = Transactions)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Transactions", title = "Number of Transactions by Day of Week")
```
```{r explore-data-1.5, echo=FALSE, warning=FALSE}
weekday_summary %>%
ggplot(aes(x = DayOfWeek, y = AverageOrderVal)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Average Order Value (£)",
title = "Number of Transactions by Day of Week")
```
Eye-balling the plots, it looks as though there are differences in the amount of revenue on each day of the week, and that this difference is driven by a difference in the number of transactions, rather than the average order value. Apparently, there are no transactions on Saturdays. The retailer might not be accepting orders that day.
Let's plot the data as a density plot to get a better feel for how the data is distributed across the days.
```{r explore-data-1.6, echo=FALSE, warning=FALSE}
weekday_summary %>%
ggplot(aes(Transactions, fill = DayOfWeek)) +
geom_density(alpha = 0.2)
```
There appears to be a reasonable amount of skewness in our distributions, so we'll use a non-parametric test to look for statistically significant differences in our data.
```{r explore-data-1.7, echo=FALSE, warning=FALSE}
kruskal.test(weekday_summary$Transactions ~ weekday_summary$DayOfWeek, data = weekday_summary)
```
The null hypothesis of the Kruskal–Wallis test is that the mean ranks of the groups are the same, the alternative is that they differ in at least one.
The p-value obtained from performing the test is significantly small, hence we reject the null hypothesis and conclude that the mean ranks of the groups are significantly different.
```{r explore-data-1.8, echo=FALSE, warning=FALSE}
kruskal(weekday_summary$Transactions, weekday_summary$DayOfWeek, console = FALSE)
```
#### Conclusions from Day of Week Analysis
Analyzing the data at the weekday level, we can observe that there are statistically significant differences in the number of transactions that take place on different days of the week, with Sunday having the lowest number of transactions, and Thursday the highest. As the average order value remains relatively constant, the number of transactions explain the difference in revenue.
Given the low number of transactions on a Sunday and a high number on a Thursday, we could make recommendations around our digital advertising spend. Should we spend less on a Sunday and more on a Thursday, given that we know we already have more transactions, which could suggest people are more ready to buy on Thursdays? Possible, but without knowing other key metrics, it might be a bit hasty to say.
While this data does reveal insight, in order to be truly actionable, we would want to combine this with more information. In particular, combining these data with web analytics data would be hugely valuable. How do these data correlate with web traffic figures? Does the conversion rate change or is there just more traffic on a Thursday and less on a Sunday?
What about out current advertising spend? Is the company already spending less on a Sunday and more on a Thursday and that is behind our observed differences? What about buying cycles? How long does it take for a customer to go from thinking about buying something to buying it? If it's usually a couple of days, should we advertise more on a Tuesday? Should we continue with an increased spend on a Thursday, when they're ready to buy, and let our competitors pay for the clicks while the customer is in the 'research' stage of the process?
These types of questions illustrate the importance of understanding the vertical, the business model and other factors and decisions which underpin the dataset, rather than just looking at the dataset in isolation.
### Hour of Day Analysis
In a similar way to the day-of-the-week analysis, is there insight to be had from looking at the hours of the day?
```{r explore-data-1.9, echo=FALSE, warning=FALSE}
initial_df %>%
group_by(HourOfDay) %>%
summarise(Revenue = sum(BasketPrice)) %>%
ggplot(aes(x = HourOfDay, y = Revenue)) +
geom_bar(stat = "identity", fill = 'steelblue') +
labs(x = "Hour of Day", y = "Revenue(£)", title = "Revenue by Hour of Day")
```
```{r explore-data-2.0, echo=FALSE, warning=FALSE}
initial_df %>%
group_by(HourOfDay) %>%
summarise(Transactions = n_distinct(InvoiceNo)) %>%
ggplot(aes(x = HourOfDay, y = Transactions)) +
geom_bar(stat = "identity", fill = 'steelblue') +
labs(x = "Hour of Day", y = "Transactions", title = "Transactions by Hour of Day")
```
#### Conclusions from Hour of Day Analysis
It certainly seems as though there is something going on here. We have more transactions and more revenue in the morning to mid-afternoon, tailing of quickly towards the early evening. There are also some hours missing, so that's something else that would also need looking into. Are there genuinely no transactions during these times, or is something else at work?
Based on examining the transactions in the dataset we can comfirm a strong presenece of wholesalers, hence it makes sense to observe more number of transactions / higher revenue during the usual working hours.
### Market Profitability Analysis
Our e-commerce retailer ships to a number of countries around the world. Let's drill into the data from that perspective and see what we can find out.
```{r explore-data-2.1, echo=FALSE, warning=FALSE, message=FALSE}
customers_world =
left_join(world, initial_df, by=c("name_long" = "Country"))
world_df = customers_world %>% dplyr::select(iso_a2, name_long, InvoiceNo) %>%
na.omit(world_df) %>%
group_by(name_long) %>% summarise(Transactions = n_distinct(InvoiceNo))
tmap_mode("view")
tm_shape(world_df) +
tm_polygons("Transactions", breaks = c(0, 10, 100, 500, 1000, 20000))
```
We see that the dataset is largely dominated by orders made from the UK.
```{r explore-data-2.2, echo=FALSE, warning=FALSE, results='asis'}
country_summary = initial_df %>%
group_by(Country) %>%
summarise(Revenue = sum(BasketPrice), Transactions = n_distinct(InvoiceNo)) %>%
mutate(AverageOrderVal = round((Revenue/ Transactions), 2)) %>%
arrange(desc(Revenue)) %>%
ungroup()
kable(country_summary[1:10, ], caption = "Transaction summary across different countries") %>%
kable_styling()
```
```{r explore-data-2.3, echo=FALSE, warning=FALSE, results='asis'}
country_customer_summary = initial_df %>%
group_by(Country) %>%
summarise(Revenue = sum(BasketPrice), Customers = n_distinct(CustomerID)) %>%
mutate(AverageCustomerSpend = round((Revenue/ Customers), 2)) %>%
arrange(desc(Revenue)) %>%
ungroup()
kable(country_customer_summary [1:10, ], caption = "Customer summary across different countries") %>%
kable_styling()
```
Plenty to see there, ceratinly. A lot of different countries contributing a good amount of revenue. As it seems that refunds and/or cancellations are present in the dataset as revenue with a negative value, we can assume that the revenue figures here are net of refunds; something that is important to consider when shipping goods overseas. However, additional information on the costs incurred dealing with these refunds would allow us to make more appropriate recommendations.
Let's begin by looking at the top five countries in terms of revenue contribution. We'll exclude the UK as we are based in the UK, so improving UK performance will undoubtedly already be on the radar.
```{r explore-data-2.4, echo=FALSE, warning=FALSE}
top5_countries = initial_df %>%
filter(Country == 'Netherlands' |
Country == 'EIRE' |
Country == 'Germany' |
Country == 'France' |
Country == 'Australia')
top5_countries_summaries =
top5_countries %>%
group_by(Country, InvoiceDate) %>%
summarise(Revenue = sum(BasketPrice),
Transactions = n_distinct(InvoiceNo),
Customers = n_distinct(CustomerID)) %>%
mutate(AverageOrderVal = round((Revenue/ Transactions), 2)) %>%
arrange(InvoiceDate) %>%
ungroup()
```
Looking at the top five non-UK by revenue, the lowest number of transactions is 69 (Australia), which, given the time period of the dataset, is still a regular number of transactions, so the inclusion of these countries seems justified
```{r explore-data-2.5, echo=FALSE, warning=FALSE}
top5_countries_summaries %>%
ggplot(aes(x = Country, y = Revenue)) +
stat_summary(fun.y = sum, geom = "bar", fill = "steelblue", colour = "black") +
labs(x = "Country", y = "Revenue(£)", title = "Revenue by Country")
```
```{r explore-data-2.6, echo=FALSE, warning=FALSE}
top5_countries_summaries %>%
ggplot(aes(x = InvoiceDate, y = Revenue, color = Country)) +
geom_smooth(formula = y~x, method = "loess", se = FALSE) +
labs(x = "Date", y = "Revenue(£)", title = "Year-on-Year Sales Revenue by Country")
```
```{r explore-data-2.7, echo=FALSE, warning=FALSE}
top5_countries_summaries %>%
ggplot(aes(x = Country, y = AverageOrderVal)) +
geom_boxplot() +
labs(x = "Country", y = "Average Order Value (£)", title = "Average Order Value by Country")
```
```{r explore-data-2.8, echo=FALSE, warning=FALSE}
top5_countries_summaries %>%
ggplot(aes(x = Country, y = Transactions)) +
geom_boxplot() +
labs(x = "Country", y = "Transactions", title = "Number of Daily Transactions by Country")
```
#### Conclusions from Market Profitablity Analysis
These simple analyses show that there are opportunities. Revenue in EIRE seems to be driven by 3 customers, who buy regularly and have a good average order value, but EIRE revenue has been declining recently. Given the small number of customers and high revenue, a bespoke email or promotion to these customers may drive loyalty and get them buying again.
The Netherlands has also been a significant source of revenue, but another which has been declining in the last few months of the dataset. Further research into this (marketing campaign activity and web analytics data) may provide further insight into why this may be the case, but it does appear that, as customers in the Netherlands have shown a willingness to purchase in the past, the country represents a good opportunity to market in order build a loyal customer base.
France and Germany represent significant opportunities. Revenue from these countries has been rising, and the number of daily transactions is [relatively] strong. Marketing campaigns which aim to improve this while increasing average transaction values may be of significant benefit.
### Business Insights
```{r explore-data-2.9, echo=FALSE, warning=FALSE, results='asis'}
overall_info = initial_df %>%
summarise(NumberOfProducts = n_distinct(Description),
NumberOfTransactions = n_distinct(InvoiceNo),
NumberOfCustomers = n_distinct(CustomerID))
kable(overall_info, caption = "Business Statistics") %>% kable_styling()
```
It can be seen that the data contains 4372 customers who have bought 3885 different products. The total number of transactions carried out is ∼ 22000.
Next, let's find out the number of products purchased in every transaction:
```{r explore-data-3.0, echo=FALSE, warning=FALSE, results='asis'}
customer_trans = initial_df %>%
group_by(CustomerID, InvoiceNo) %>%
summarise(NumberOfProducts = n())
kable(customer_trans[1:10, ],
caption = "Glimpse of number of items purchased per transactions") %>% kable_styling()
```
From the the first few lines of the table above we can observe interesting shopping patterns:
- The existence of entries with the prefix 'C' for the InvoiceNo variable: this indicates transactions that have been cancelled.
- The existence of customers who only shopped once and purchased only one product
(e.g. CustomerID: 12346).
- The existence of frequent customers who buy a large number of items at each order.
**Who is our most profitable customer?**
```{r echo=FALSE, warning=FALSE, message=FALSE, results='asis'}
most_revenue = initial_df %>% group_by(CustomerID, Country) %>%
summarise(SalesRevenueContribution = sum(BasketPrice)) %>%
arrange(desc(SalesRevenueContribution))
kable(most_revenue[1:5, ],
caption = "Summary of contribution to sales revenue (£)") %>% kable_styling()
```
As we can see from the table above, **CustomerID: 14646** from Netherlands contributes the most to our sales revenue, folled by **CustomerID: 18102** from the UK.
Interesting thing to note here is that we have 9 customers from Netherlands who collectively contribute **£284,661.54** to our sales, out of which **CustomerID: 14646 contributes ~98%**.
The same thing can be said about **CustomerID: 12414** from Australia, they contribute **~90%** of Australia's contribution to sales.
As mentioned plently of times already, this shows the presenece of wholesale buyers in our dataset. This is one of the reasons why it is all the more important to carefully segment our customers so that we can provide them with information they will be most interested in consuming. The segmentation will further allow us as online retailers to make better marketing decisions and come up with targeted interactions for our customers.
**Let's take a look at our inventory: What are the most popular and profitable products?**
```{r explore-data-3.1, echo=FALSE, warning=FALSE, message=FALSE}
products_sold = initial_df %>%
group_by(Description) %>%
summarise(NumSold = n())
products_sold$Description <- factor(products_sold$Description ,
levels = products_sold$Description [order(products_sold$NumSold)])
ps = products_sold %>% arrange(desc(NumSold)) %>% top_n(10) %>%
ggplot(aes(x = Description, y = NumSold)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Product", y = "Number of Products Sold", title = "Top 10 Most Popular Products") +
coord_flip()
products_sold_by_revenue = initial_df %>%
group_by(Description) %>%
summarise(Revenue = sum(BasketPrice))
products_sold_by_revenue$Description <- factor(products_sold_by_revenue$Description ,
levels = products_sold_by_revenue$Description
[order(products_sold_by_revenue$Revenue)])
psr = products_sold_by_revenue %>% arrange(desc(Revenue)) %>% top_n(10) %>%
ggplot(aes(x = Description, y = Revenue)) +
geom_bar(stat = "identity", fill = "maroon") +
labs(x = "Product", y = "Sales Revenue (£)",
title = "Top 10 Products by Revenue") +
coord_flip()
grid.arrange(ps, psr, nrow=2, ncol=1)
```
### Order Cancellations
```{r explore-data-3.3, echo=FALSE, warning=FALSE}
cancelled = customer_trans %>%
mutate(isCancelledOrder = ifelse(startsWith(InvoiceNo, "C"), 1, 0))
percent = round((sum(cancelled$isCancelledOrder) /
overall_info$NumberOfTransactions), 3) * 100
```
**Total number of orders cancelled: `r sum(cancelled$isCancelledOrder)`, i.e.`r percent`% of overall orders cancelled.**
```{r explore-data-3.4, echo=FALSE, warning=FALSE}
lost_rev = initial_df %>%
mutate(LostRevenue = ifelse(startsWith(InvoiceNo, "C"), BasketPrice, 0))
```
```{r echo=FALSE, warning=FALSE, results='asis'}
co = data.frame("Numumber of Cancelled Orders"=sum(cancelled$isCancelledOrder),
"Total Lost Revenue"=paste("£",abs(sum(lost_rev$LostRevenue))),
"Total Sales Revenue"=paste("£",sum(initial_df$BasketPrice)))
kable(co, caption = "Summary of Order Cancellations") %>% kable_styling()
```
### Insights on Product Description - Finding Keywords
In our dataset, products are uniquely identified through the <b>StockCode</b> variable. A short description of the products is given in the <b>Description</b> variable. In this section, we attempt to group the products based on the product description. This information will help us better cluster our customers later, and provide crucial insights for targeting marketing.
We first create a corpus of the product descriptions and apply pre-processing techniques before converting it into a Document Term Matrix (DTM). During an initial examination we found **2166 keywords** in the DTM. The most frequent ones appeared in more than **200 products**. Some of the keywords are useless. Keywords like colors do not carry much information.
For the final DTM these words have been discarded from the analysis, also words appearing **fewer than 20 times** have been left out.
```{r product-category-1, echo=FALSE, warning=FALSE}
description = unique(initial_df$Description)
corpus = tm::Corpus(tm::VectorSource(description))
# Cleaning up
# Handling UTF-8 encoding problem from the dataset
corpus.cleaned = tm::tm_map(corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
# Convert words to lower case
corpus.cleaned = tm::tm_map(corpus.cleaned, tolower)
# Removing stop-words
corpus.cleaned = tm::tm_map(corpus.cleaned, tm::removeWords, tm::stopwords('english'))
# Removing specific terms
corpus.cleaned = tm::tm_map(corpus.cleaned, tm::removeWords,
c("pink", "red", "blue", "tag", "white", "black", "green", "set"))
# Trimming excessive whitespaces
corpus.cleaned = tm::tm_map(corpus.cleaned, tm::stripWhitespace)
```
```{r product-category-1.1, echo=FALSE, warning=FALSE}
dtm = tm::DocumentTermMatrix(corpus.cleaned,
control=list(bounds = list(global = c(20,Inf))))
#inspect(t(dtm))
```
```{r product-category-1.2, echo=FALSE, warning=FALSE}
mat <- as.matrix(t(dtm))
freq_words <- sort(rowSums(mat), decreasing=TRUE)
```
Let's take a look at some keywords that appear multiple times in the product descriptions:
```{r product-category-1.3, echo=FALSE, warning=FALSE, message=FALSE}
df_keywords = as.data.frame(freq_words)
df_keywords[ "words" ] <- rownames(df_keywords)
rownames(df_keywords) <- NULL
df_keywords$words <- factor(df_keywords$words,
levels = df_keywords$words[order(df_keywords$freq_words)])
df_keywords %>% arrange(desc(freq_words)) %>% top_n(25) %>%
ggplot(aes(x = words, y = freq_words)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "keywords", y = "frequency ", title = "Keywords in Product Descriptions") +
coord_flip()
```
From the wordcloud below we can observe a range of different products in our inventory, i.e. **gifts** (keywords: christmas, decoration, flower, cake), or items for **home** (keywords: holder, mug, glass, bowl), or **jewelry** products (keywords: necklace, bracelet, silver, earrings).
```{r echo=FALSE, warning=FALSE, fig.align = "left", fig.height = 5.0, fig.width = 8}
set.seed(123)
wordcloud(df_keywords$words, df_keywords$freq_word,
colors = brewer.pal(8, "Dark2"),
min.freq = 2, random.order=FALSE, rot.per=0.20,
scale=c(5.0,0.25))
```
In order to derive actionable insights it is extremely valuable to have the product categories for our customer segmentation analysis. This will help us better understand what our customers like to buy, and help us come up with better promotional and marketing strategies.
# Scrape Product Categories
Since, product categories can provide us with some crucial information of what kind of products a customer usually purchases, we will scrape the data from a popular online shop that has the notion of a “product category” - **Walmart**
```{r image2, echo=FALSE, fig.cap="Product categories scraped from Walmart", out.width = '100%'}
knitr::include_graphics("walmart.png")
```
Assuming that entering a product category for each item would take ~15 seconds, and since there are 3885 unique products in our dataset, we saved ~16 hours.
The 32 product categories were scraped and written to a file for future use. Below is a list of a few selected products and the categories we matched after scraping:
```{r scrape-product-categories-1, echo=FALSE, warning=FALSE, message=FALSE, results='asis'}
# des = des %>%
# mutate(ProductURL = gsub(" ", "%20", Description)) %>%
# mutate(ProductURL = paste("https://www.walmart.com/search/?query=",
# ProductURL, sep = ""))
#
# cl = makeCluster(detectCores()-1)
# registerDoParallel(cl)
# t1 = Sys.time()
# des$ProductCategory = foreach(i = seq_along(des$ProductURL),
# .packages = "rvest",
# .combine = "c",
# .errorhandling='pass') %dopar% {
# page = html_session(des$ProductURL[i]) %>%
# html_nodes(".dept-head-list-heading") %>% html_text()
# pc = "Other"
# if (length(page) > 0) {
# pc = page[[1]]
# }
# return(pc)
# }
# t_final = Sys.time()
# t_parallel = t_final - t1
# write_csv(des, "product_categories1.csv")
product_categories = read_csv("product_categories.csv")
pc = product_categories %>% dplyr::select(Description, ProductCategory)
pc = pc %>% filter(row_number() == 1 | row_number() == 3 |
row_number() == 5 | row_number() == 11 |
row_number() == 30 | row_number() == 35)
kable(pc, caption = "Matching descriptions with product categories scraped from Walmart") %>%
kable_styling()
```
Maximum products belonged to category **Home**, followed by **Part & Occasions**. This sort of verifies our findings from the *Product Insights* section above.
The plot below helps us understand the inventory of our retailer. Also, each product category on Walmart has sub-categories which the items can fall under. We have not considered those categories in our analysis as the **parent categories** our good enough for now.
```{r scrape-product-categories-1.1, echo=FALSE, warning=FALSE, message=FALSE}
pc_plot = product_categories %>%
group_by(ProductCategory) %>%
summarise(Count = n()) %>% arrange(desc(Count))
pc_plot$ProductCategory <- factor(pc_plot$ProductCategory ,
levels = pc_plot$ProductCategory
[order(pc_plot$Count)])
pc_plot %>% top_n(10) %>% ggplot(aes(x = ProductCategory, y = Count)) +
geom_bar(stat = "identity", fill = "maroon") +
labs(x = "Product Category", y = "Count of Items Sold",
title = "Top 10 Product Categories") +
coord_flip()
```
With the products categories now available, we have a dataset which we can use to extract customers' spending behavior, their products of interest and some basic information about their activity.
# Customer Segmentation
Let's use the customer’s spending behavior, their products of interest and some basic information about their activity to perform segmentation.
**Useful info for our analysis are**:
- Average basket value
- Basket value range (min, max)
- Order frequency
- Tendency to cancel an order
- User’s activity (first and last purchase time)
- Products of interest
Let’s group each customer and determine the **number of transactions** made by each of them, **minimum, maximum, average amount** spent on all transactions, **total amount** spent, days since **first purchase**, days since **last purchase** and finally how much each **customer spends in each category**. We now have our final dataset.
```{r customer-segmentation-1, echo=FALSE, warning=FALSE, message=FALSE, results='asis'}
last_date = max(initial_df$InvoiceDate)
customer_order_summary = initial_df %>%
group_by(CustomerID) %>%
summarise(n_baskets = n_distinct(InvoiceNo),
min_basket = min(BasketPrice),
avg_basket = mean(BasketPrice),
max_basket = max(BasketPrice),
total_basket = sum(BasketPrice),
first_purchase = min(InvoiceDate),
last_purchase = max(InvoiceDate)) %>%
mutate(first_purchase = as.integer(last_date - first_purchase),
last_purchase = as.integer(last_date - last_purchase))
temp_df = initial_df %>% left_join(product_categories)
customer_product_cat = temp_df %>%
spread(ProductCategory, BasketPrice, fill = 0, convert = TRUE) %>%
dplyr::select(-InvoiceNo, -StockCode, -Description, -Quantity, -InvoiceDate,
-Month, -Year, -InvoiceTime, -HourOfDay, -UnitPrice, -Country,
-DayOfWeek, -ProductURL) %>%
group_by(CustomerID) %>% summarise_all(.funs = sum)
customer_order_summary = customer_order_summary %>% left_join(customer_product_cat)
customer_order_summary$CustomerID = as.integer(customer_order_summary$CustomerID)
kable(customer_order_summary[1:5, c(1:8, 16:17,21, 38)],
caption = "Summary of customer purchase history") %>% kable_styling(font_size = 8)
```
Since we have 32 product categories, not all categories have been shown in the representation above.
### Statistical Clustering - K-means algorithm
The segmentation will be performed using K-means clustering, which is a simple and elegant way of subsetting the customers into non-overlapping segments. There are advantages and disadvantages of this type of clustering.
**Advantages:**
- Relatively simple to implement.
- Scales to large data sets.
- Guarantees convergence.
- Can warm-start the positions of centroids.
- Easily adapts to new examples.
- Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
**Disadvantages:**
- Choosing 'k' manually: Use the “Loss vs. Clusters” plot to find the optimal (k)
- Being dependent on initial values: For a low , you can mitigate this dependence by running k-means several times with different initial values and picking the best result.
- Clustering data of varying sizes and density: K-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means as described in the Advantages section.
<b>Importance of scaling the data before performing K-means: </b>
In our dataframe for customer segmentation described above, variables are measured in different units, where a unit increase or decrease in one day for **first_purchase** and **last_purchase** is completely different than a unit increase or decrease in pounds for **total_basket**. Therefore the importance of scaling the data, to represent the true distance among variables. The data has been scaled using the function scale().
<b>Choosing the oprimal number of clusters: </b>
As we learned before, the k-means algorithm doesn’t choose the optimal number of clusters upfront, but there are different techniques to make the selection. The most popular ones are within cluster sums of squares, average silhouette and gap statistics. The silhouette statistic for a single element compares its mean inner-cluster distance to the mean distance from the neighbouring cluster. It varies from -1 to 1, where high positive values mean the element is correctly assigned to the current cluster, while negative values signify it’s better to assign it to neighbouring one. Here we present average silhouette across all data points:
```{r customer-segmentation-1.1, echo=FALSE, warning=FALSE}
# 1. Loading and preparing data
scaled_cutomer_order_summary = as.data.frame(scale(customer_order_summary))
# 2. Find optimal number of clusters for k-means
fviz_nbclust(scaled_cutomer_order_summary, kmeans, method='silhouette')
```
As you can see above, the optimal number of clusters is 3 hands-down. So let’s choose 3.
```{r customer-segmentation-1.2, echo=FALSE, warning=FALSE}
set.seed(123)
# 3. Compute k-means
km_model = kmeans(scaled_cutomer_order_summary, 3, nstart = 25)
customer_order_summary$Cluster = km_model$cluster
```
**Let's verify if the clusters were extracted correctly?**
Our dataset stores 40 variables, so it’s impossible to compare assigned clusters across all variables (readable visualisations are restricted to a maximum 3 dimensions).
One of the most popular approaches that helps solve the problem is **Principal Component Analysis (PCA)**. PCA combines variables of a provided dataset to create new ones, called PCA components, that capture most of the dataset variation. Plotting clusters distribution across first PCA components should allow us to see if the clusters are separated or not.
```{r customer-segmentation-1.3, echo=FALSE, warning=FALSE}
pca <- PCA(scaled_cutomer_order_summary, graph = FALSE)
fviz_screeplot(pca, addlabels = TRUE, ylim = c(0, 50))
```
For this case, let’s plot how the clusters were distributed comparing the 1st vs. the 2nd, as well as the 1st vs. the 3rd PCA components.
```{r customer-segmentation-1.4, echo=FALSE, warning=FALSE}
fviz_cluster(km_model, data = scaled_cutomer_order_summary,
axes = c(1,2),
geom = "point",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
ggtheme = theme_minimal(),
main = "Partitioning Clustering Plot Dim1 vs. Dim2")
fviz_cluster(km_model, data = scaled_cutomer_order_summary,
axes = c(1,3),
geom = "point",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
ggtheme = theme_minimal(),
main = "Partitioning Clustering Plot Dim1 vs. Dim3")
```
From the plots above we can certainly conclude that all the three clusters are well seperated, there is no overlap whatsoever. To sum up, we’re happy with this result and we can now move to the next part of our analysis.
<b> How can we detect which indicators along 40 variables distinguish our customers? </b>
### RFM Analysis: (Recency, Frequency, Monetary)
Recency, frequency, monetary value is a marketing analysis tool used to identify a company's or an organization's best customers by using certain measures. The RFM model is based on three quantitative factors:
- Recency: How recently a customer has made a purchase
- Frequency: How often a customer makes a purchase
- Monetary Value: How much money a customer spends on purchases
Below is a summary table that explains the differences in the three clusters.
```{r customer-segmentation-1.5, echo=FALSE, warning=FALSE, results='asis'}
cluster_diff = customer_order_summary %>% group_by(Cluster) %>%
summarise('Number of Customers' = n(),
'Recency Mean' = round(mean(last_purchase)),
'Frequency Mean' = scales::comma(round(mean(n_baskets))),
'Monetary Value Mean' = scales::comma(round(mean(total_basket))),
'Cluster Revenue' = scales::comma(sum(total_basket)))
kable(cluster_diff, caption = "Diffreence between the three clusters") %>% kable_styling()
```
In general, it’s necessary to analyse distributions for each variable grouped by the assigned cluster. Boxplots could be used to analyze the distributions of the relevant variables. Below we present box plots to analyze Recency, Frequency and Monetary in each of the three cluster.
```{r customer-segmentation-1.6, echo=FALSE, warning=FALSE, fig.height = 8, fig.width = 8}
customer_order_summary$Cluster = as.factor(customer_order_summary$Cluster)
r = customer_order_summary %>%
ggplot(aes(x = Cluster, y = last_purchase, fill = Cluster)) +
geom_boxplot(fill = c("steelblue1", "gold3", "orangered3")) +
labs(x = "Cluster", y = "Number of Days",
title = "Recency: Distribution of Days since Last Order") +
scale_fill_brewer(palette="RdBu") + theme_minimal()
f = customer_order_summary %>%
ggplot(aes(x = Cluster, y = n_baskets, fill = Cluster)) +
geom_boxplot(fill = c("steelblue1", "gold3", "orangered3")) +
labs(x = "Cluster", y = "Number of Transactions",
title = "Frequency: Distribution of Transactions") +
scale_fill_brewer(palette="RdBu") + theme_minimal()
m = customer_order_summary %>%
ggplot(aes(x = Cluster, y = total_basket, fill = Cluster)) +
geom_boxplot(fill = c("steelblue1", "gold3", "orangered3")) +
labs(x = "Cluster", y = "Order Value (£)",
title = "Monetary: Distribution of Order Value") +
scale_fill_brewer(palette="RdBu") + theme_minimal()
grid.arrange(r, f, m, nrow = 3)
```
From the above summary we can detect a few simple characteristics about customers in each cluster.
<b> Cluster 1 (Blue):</b>
- Tends to spend a lot of money for each basket, £241,083 on average.
- They order in bulk.
- Products of interest for the group are varied.
- The clients on average are also the most active in the recent past.
- We can classify the group as high value customers (Wholesalers).
<b> Cluster 2 (Golden): </b>
- Tends to spend moderate amount of money on each basket, £57,323 on average.
- They order the highest number of baskets on average.
- Spend moderately across each product category.
- The clients order weekly on average.
- We can classify the group as regular customers.
<b> Cluster 3 (Red): </b>
- Tends to spend a low amount of money for each basket, £1,448 on average.
- The number of transactions are extremenly low, 5 baskets on average.
- The clients on average are least active in the recent past, take months before their next purchase.
- We can classify the group as typical bargain hunters (Non-Wholesalers).
### Products of Interest within each Cluster
Next, let's analyze the tendency of each of the three clusters for buying a product in a specific category.
Categories **Home** and **Part & Ocassions** have been left out as they generate maximum revenue for each of the three clusters. We'll focus on a few other catgories that boost sales within and across the three clusters.
```{r customer-segmentation-1.7, echo=FALSE, warning=FALSE}
product_stats_cluster = customer_order_summary %>%
dplyr::select(-CustomerID, -n_baskets, -min_basket, -avg_basket,
-max_basket, -total_basket, -first_purchase, -last_purchase)
product_stats_cluster =
product_stats_cluster %>% gather(key = "ProductCategory", value = "BasketValue", -Cluster)
product_stats_cluster %>%
filter(ProductCategory %in% c("Arts Crafts & Sewing", "Jewelry",
"Clothing", "Office Supplies",
"Toys", "Pets", "Food",
"Patio & Garden")) %>%
ggplot(aes(x = ProductCategory, BasketValue)) +
stat_summary(fun.y=sum,geom="bar",fill="#CC6666",colour="black") +
labs(x = "Product Category", y = "Sales Revenue (£)",
title = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~Cluster, scales = "free")
```
From the bar plots above we can summarize the tendency for buying in a specfic category.
<b> Cluster 1: </b>
- Customers spend a lot on Office Supplies, followed by Patio & Garden and Clothing.
- They spend less on Jewelry and Food products,
- Spend the most on products for Pets across the three clusters.
<b> Cluster 2: </b>
- Spend most on Toys, followed by Clothing and Arts Crafts & Sewing.
- Least sales revenue from products for Pets.
<b> Cluster 3: </b>
- Spend most on Clothing, followed by Toys and Office Supplies.
- Least sales revenue from products for Pets.
# Further Segmentation - Hiererchical Clustering
To enhance this clustering analysis it was decided to further segment the largest cluster of customer in the first segementation (Cluster 3), this further sub-segmentation was performed using hierechical clustering to further understand the customer characteristics of this group.
Monetary Value was selected as the value for the further segmentation, using frequency and recency as estimators for it.
```{r customer-segmentation-1.8, echo=FALSE, warning=FALSE}
tree_cluster3 = customer_order_summary %>%
filter(Cluster == '3') %>%
dplyr::select(n_baskets, total_basket, last_purchase)
fit_tree = rpart(total_basket ~ .,
data = tree_cluster3,
method = 'anova',
control = rpart.control(cp=0.0127102))
rpart.plot(fit_tree, type=1,extra=1, box.palette=c("gray","lightblue"))
```
This sub-segmentation of Cluster 3, divided the cluster into 7 smaller different clusters.
<b> Results: (From low value to high value customers) </b>
- 2,130 customers that purchase less than 3 times, average monetary value of £412.
- 1,156 customers that purchase greater than 3 times but lesser than 6 times, average monetary value of £1,166 (significantly higher than the previous group).
- 550 customers that purchase greater than 6 times but lesser than 10 times, average monetary value of £2,159.
- 323 customers that purchase greater than 10 times but lesser than 17 times, average monetary value of £3,738.
- 167 customers that purchase greater than 17 times and lesser than 41 times, average monetary value of £7,532.
- 12 customers that purchase greater than 41 times and lesser than 52 times, average monetary value of £15,000.
- 10 customers that purchase greater than 52 times, average monetary value of £24,000.
This last sub-segment of 10 customers represents the most valuable customers within Cluster 3. From these insights, executive and management team can take further strategic actions to increase the averague monetary value of lower sub-segments within this cluster of customers.
# How can we use the results for targeted marketing?
We were able to group our customers based on their purchase behaviour and we managed to detect meaningful factors for each group. The best way forward is to prepare specific interactions for each one.
Here are some ideas:
1. For **Cluster 1**, all the high-value customers may be entrepreneurs, so they order wholesale quantities of products. We can prepare an offer for them to get an extra discount when they buy in bulk. Also, target advertisement for Pets' products.
2. As for regular customers in **Cluster 2**, they might be encouraged to return if we inform them about new and/or unique products from our line. We could even include recommendations from the appropriate influencers. Let them know about discounts on Toys and Clothing or about new products in these categories.
3. We can target the top 22 customers obtained from further segmenting Cluster 3 using similar strategies as for other wholesalers and customers who buy in bulk. For other customers in **Cluster 3**, we can offer selected promotions for products from their categories of interest. We could periodically send the discount offers by email or show the message right after the user logs in to our website.
# Summary
To sum up, by answering a few questions about the data and applying popular clustering methods we managed to get interesting information about our clients.
In this analysis we analyzed sales trends, market profitability, order cancellations and product categories through exploratory data analysis. We scraped product categories from Walmart's website and created a dataframe to summarize customers' purchase history. This dataframe was used to perform customer segmentation. The clusters obtanined by implementing k-means and hierarchical clustering algorithms helped us segment our customers in order to gain insight into their shopping behaviors, analyze product affinity for each cluster, measure marketing effectiveness, and better allocate future marketing spend.