-
Notifications
You must be signed in to change notification settings - Fork 0
/
Week4-Lecture1.qmd
269 lines (179 loc) · 7.68 KB
/
Week4-Lecture1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
date: "`r (lubridate::ymd('20241002') + lubridate::dweeks(3))`"
title: "Class 7 Unsupervised Learning and K-Means Clustering"
---
# Overview of Predictive Analytics
```{r}
#| echo: false
pacman::p_load(dplyr, ggplot2, ggthemes)
```
## Roadmap for Predictive Analytics
- The core of any business decision is **profitability analysis** (BEQ, NPV, CLV). To increase firm profitability,
(1) Increase revenue
(2) Reduce costs (CAC or variable marketing costs)
(3) Reduce customer churn
- In Weeks 4 and 5, we will learn how to utilize **predictive analytics** to improve profitability. Correspondingly,
(1) Develop customers through ML recommender systems
(2) **Reduce costs by targeting more responsive customers**
(3) Predict customer churn and take preventive actions
```{r}
#| echo: false
#| fig-align: 'center'
knitr::include_graphics("images/CourseStructure2024.png")
```
## Types of Predictive Analytics
- Unsupervised Learning
- Only observe X =\> Want to uncover unknown subgroups
- Supervised Learning
- Observe both X and Y =\> Want to predict Y for new data
In Term 2, you will learn predictive analytics models systematically. By then, think about how those techniques can be applied back to these case studies.
## Types of Predictive Analytics
```{r}
#| echo: false
#| fig-align: 'center'
knitr::include_graphics("images/Week 4/PredictiveAnalyticsTypes.png")
```
## Learning Objectives for Today
- Understand the concept of unsupervised learning
- Understand how to apply K-means clustering and find the optimal number of clusters
- How to apply clustering analyses for customer segmentation for M&S
# K-Means Clustering
## K-Means Clustering
- K-means clustering is one of the most commonly used unsupervised machine learning algorithms for partitioning a given data set into a set of *k* groups (i.e. *k* clusters), where *k* represents the number of groups pre-specified by the analyst.
- For data scientists: It can classify customers into multiple segments (i.e., clusters), such that customers within the same cluster are as **similar** as possible, whereas customers from different clusters are as **dissimilar** as possible.
- Input: (1) customer characteristics; (2) the number of clusters
- Output: cluster membership of each customer
## Similarity and Dissimilarity
- The clustering of observations into groups requires computing the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix.
- The choice of similarity measures is a critical step in clustering.
- The most common distance measures are the Euclidean distance (the default for K-means) and the Manhattan distance.
## Euclidean Distance
- The most common distance measure is the Euclidean distance.
$$
d_{\text{euc}}(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$
- Example of Income and Spending for 3 customers
- $Income = (5, 10, 20)$
- $Spending = (3, 4, 12)$
- Euclidean distance
- $d_{\text{euc}}(2, 1) = \sqrt{(10-5)^2 + (4-3)^2} = \sqrt{25 + 1} = \sqrt{26}$
- $d_{\text{euc}}(2, 3) = \sqrt{(10-20)^2 + (4-12)^2} = \sqrt{100 + 64} = \sqrt{164}$
::: {.content-visible when-format='beamer'}
## Visualization of Euclidean Distance
:::
::: {.columns}
::: {.column width='50%'}
```{r}
#| out.width: "100%"
#| warning: false
Income <- c(5, 10, 20)
Spending <- c(3, 4, 12)
data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3"))
ggplot(data, aes(x = Income, y = Spending)) +
geom_point(aes(shape = ID, color = ID), size = 2.5) +
geom_text(aes(label = rownames(data)), vjust = -0.5) +
theme_stata()
```
:::
::: {.column width='50%'}
```{r}
#| out.width: "100%"
#| warning: false
Income <- c(5, 10, 20)
Spending <- c(3, 4, 12)
data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3"))
ggplot(data, aes(x = Income, y = Spending)) +
geom_point(aes(shape = ID, color = ID), size = 2.5) +
geom_text(aes(label = rownames(data)), vjust = -0.5) +
theme_stata() +
# show the Eucleadian distance between Customer 1 and Customer 2; also show the vertical and horizontal lines
geom_segment(aes(x = 5, y = 3, xend = 10, yend = 4), linetype = "dashed") +
geom_segment(aes(x = 5, y = 3, xend = 5, yend = 4), linetype = "dashed") +
geom_segment(aes(x = 5, y = 4, xend = 10, yend = 4), linetype = "dashed") +
# show the Eucleadian distance between Customer 2 and Customer 3; also show the vertical and horizontal lines
geom_segment(aes(x = 10, y = 4, xend = 20, yend = 12), linetype = "dashed") +
geom_segment(aes(x = 10, y = 4, xend = 10, yend = 12), linetype = "dashed") +
geom_segment(aes(x = 10, y = 12, xend = 20, yend = 12), linetype = "dashed")
```
:::
:::
## Manhattan Distance
- Another common distance measure is the Manhattan distance, which is less commonly used because the absolute value function is not differentiable.
$$
d_{\text{man}}(x, y) = \sum_{i=1}^{n} |x_i - y_i|
$$
- Example of Income and Spending for 3 customers
- $Income = c(5, 10, 20)$
- $Spending = c(3, 4, 12)$
- Distance
- $d_{\text{man}}(2, 1) = |10-5| + |4-3| = 5 + 1 = 6$
- $d_{\text{man}}(2, 3) = |10-20| + |4-12| = 10 + 8 = 18$
## K-Means Clustering: Step 1
::: columns
::: {.column width="40%"}
![](images/Week 4/kmeans_1.png){fig-align="left"}
:::
::: {.column width="60%"}
- Raw data points; each dot is a customer
- X and Y axis are customer characteristics, say, income and spending
- Obviously there should be 2 segments
- Let's see how K-means uses a data-driven way to classify customers into 2 segments
:::
:::
## K-Means Clustering: Step 2
::: columns
::: {.column width="40%"}
![](images/Week 4/kmeans_2.png){fig-align="left"}
:::
::: {.column width="60%"}
- We specify 2 segments
- K-means initializes the process by **randomly** selecting 2 centroids
:::
:::
Due to this randomness, different starting points may yield varying results. We need to reinitialize the process repeatedly to ensure **robustness** of results.
## K-Means Clustering: Step 3
::: columns
::: {.column width="40%"}
![](images/Week 4/kmeans_3.png){fig-align="left"}
:::
::: {.column width="60%"}
- K-means computes the distance of each customer to the red and blue centroids
- K-means assigns each customer to red segment or blue segment based on which centroid is closer
:::
:::
## K-Means Clustering: Step 4
::: columns
::: {.column width="40%"}
![](images/Week 4/kmeans_4.png){fig-align="left"}
:::
::: {.column width="60%"}
- K-means updates the new centroids of each segment
- The red cross and blue cross in the picture are the new centroids
- We still see some "outliers", so the algorithm continues
:::
:::
## K-Means Clustering: Step 5
::: columns
::: {.column width="40%"}
![](images/Week 4/kmeans_5.png){fig-align="left"}
:::
::: {.column width="60%"}
- K-means computes the distance of each customer to the red and blue centroids
- K-means updates each customer to red segment or blue segment based on which centroid is closer
- Now the outliers are correctly assigned each segment
:::
:::
## K-Means Clustering: Step 6
::: columns
::: {.column width="50%"}
![](images/Week 4/kmeans_6.png){fig-align="left"}
:::
::: {.column width="60%"}
- K-means updates the new centroid from the previous clustering
- K-means computes the distance of each customer to the new centroids
- K-means finds that all customers are correctly assigned to their nearest centroids, so the algorithm does not need to continue
- We say, the algorithm **converges**, and the algorithm stops
:::
:::
## After-Class Readings
- More technical details: [K-means Cluster Analysis](https://uc-r.github.io/kmeans_clustering)