-
Notifications
You must be signed in to change notification settings - Fork 35
/
R-机器学习 曲线分离.Rmd
445 lines (343 loc) · 16.3 KB
/
R-机器学习 曲线分离.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
---
title: "机器学习之曲线分离"
author: "李誉辉"
date: "2020/12/14"
output:
html_document:
highlight: pygments
theme: cerulean
toc: yes
toc_depth: 4
toc_float:
collapsed: true
smooth_scroll: false
code_folding: hide
number_sections: true
df_print: paged
---
# 引言
  在现实环境中,我们有许多时间序列数据,比如超市零售数据,可能同时包括成千上万种商品的销量历史数据。
不同商品销量随时间会有不同的变化。如果需要对销量进行预测,首先就应该根据趋势,对商品进行分类。
然后对同一类的商品建立相同的模型,当然模型的参数略微有些不同。<br/>
  似乎很少见到根据曲线趋势进行分类的例子,如何评价曲线的趋势?这是一个问题。
我们首先应该想到,相关系数可以评价2个曲线之间的相关程度。
同一类曲线之间,应该具有较高的相关系数。因此,我们可以先计算曲线之间的相关系数,
构成相关矩阵,然后转换为数据框,再进行聚类分析。考虑到上千条曲线,产生的矩阵维度非常高,
因此在聚类分析之前,还应该进行PCA降维。<br/>
  相关系数具有正负性,在计算相关系数之前,最好先将上升和下降趋势的曲线分离,
然后分别计算相关系数和聚类分析。毕竟对于上升和下降的时间序列,模型差异会非常大。
如何分离上升和下降趋势的曲线?最简单的方法,就是建立一阶线性回归模型,
然后用`coef(model)[2]`提取拟合曲线的斜率,如果斜率$>0$,则为上升趋势,$<0$则为下降趋势。<br/>
  考虑到不同商品的单价差异巨大,在一张图上绘制所有曲线,不便于观察,首先就需要使用`scale`进行放缩。<br/>
# 需要的包
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
rm(list = ls()); gc() # 清空内存
setwd("D:/R/working_documents1") # 工作目录
library(dplyr)
library(magrittr)
library(purrr)
library(tibble)
library(tidyr)
library(ggplot2)
library(readr)
library(data.table) # big data process
library(lubridate) # date data process
library(echarts4r)
library(psych) # PCA
library(ggdendro) #
library(dendextend)
library(ape)
library(patchwork)
library(modelr)
```
# 编造数据
  为了方便演示,这里使用编造数据,因为真实的数据异常情况太多,每一步都需要筛选等预处理。<br/>
<p style="color:red; font-size:200%; font-weight:bold">第一组曲线:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.width=8}
curves_DT <-
data.table(DATE = seq(from = as.Date("2017-01-01"),
to = as.Date("2020-12-31"),
by = "month"))
curves_DT[, x := 1:.N]
# 第1条曲线
curves_DT[, sales1 := sin(pi/6*x) + 0.005*x^2 + 0.1*x + 50 # 周期为12个月
][, sales1 := sales1 + runif(length(sales1))/100*sales1] # 增加随机扰动
# 第2条曲线
curves_DT[, sales2 := sin(pi/6*x) + 0.2*x + 60
][, sales2 := sales2 + runif(length(sales2))/100*sales2] # 增加随机扰动
# 第3条曲线
curves_DT[, sales3 := sin(pi/6*x)/2 + sqrt(x+5) + 70
][, sales3 := sales3 + runif(length(sales3))/100*sales3] # 增加随机扰动
# 第4条曲线
curves_DT[, sales4 := sin(pi/6*x)/2 + 0.5*log(x^2) + 70
][, sales4 := sales4 + runif(length(sales4))/100*sales4] # 增加随机扰动
# 画图
par(mfrow = c(2, 2))
curves_DT[, plot(DATE, sales1, type = "l", col = "steelblue", lwd = 2,
main = "curve1")]
curves_DT[, plot(DATE, sales2, type = "l", col = "steelblue", lwd = 2,
main = "curve2")]
curves_DT[, plot(DATE, sales3, type = "l", col = "steelblue", lwd = 2,
main = "curve3")]
curves_DT[, plot(DATE, sales4, type = "l", col = "steelblue", lwd = 2,
main = "curve4")]
```
<p style="color:red; font-size:200%; font-weight:bold">第二组曲线:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.width=8}
# 第5条曲线
curves_DT[, sales5 := sin(pi/6*x)/4 + 2*log(x) + 60
][, sales5 := sales5 + runif(length(sales5))/100*sales5] # 增加随机扰动
# 第6条曲线
curves_DT[, sales6 := sqrt(abs(sin(pi/6*x))) + 2*log(x) + 70
][, sales6 := sales6 + runif(length(sales6))/100*sales6] # 增加随机扰动
# 第7条曲线
curves_DT[, sales7 := sin(pi/6*x)/4 + 5*sqrt(log(x+2)) + 70
][, sales7 := sales7 + runif(length(sales7))/100*sales7] # 增加随机扰动
# 第8条曲线
curves_DT[, sales8 := sin(pi/6*x)/6 + 3*log(x+3) + 80
][, sales8 := sales8 + runif(length(sales8))/100*sales8] # 增加随机扰动
# 画图
par(mfrow = c(2, 2))
curves_DT[, plot(DATE, sales5, type = "l", col = "steelblue", lwd = 2,
main = "curve5")]
curves_DT[, plot(DATE, sales6, type = "l", col = "steelblue", lwd = 2,
main = "curve6")]
curves_DT[, plot(DATE, sales7, type = "l", col = "steelblue", lwd = 2,
main = "curve7")]
curves_DT[, plot(DATE, sales8, type = "l", col = "steelblue", lwd = 2,
main = "curve8")]
```
<p style="color:red; font-size:200%; font-weight:bold">第三组曲线:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.width=8}
# 第9条曲线
curves_DT[, sales9 := 0.005*(x-24)^3 + 80
][, sales9 := sales9 + runif(length(sales9))/100*sales9] # 增加随机扰动
# 第10条曲线
curves_DT[, sales10 := 0.005*(x-24)^3 + 2*sin(pi/6*x) + 90
][, sales10 := sales10 + runif(length(sales10))/100*sales10] # 增加随机扰动
# 第11条曲线
curves_DT[, sales11 := 0.005*(x-24)^3 + 5*sin(pi/6*x)^2 + 100
][, sales11 := sales11 + runif(length(sales11))/100*sales11] # 增加随机扰动
# 第12条曲线
curves_DT[, sales12 := 0.005*(x-24)^3 + 5*cos(pi/6*x)^2 + 0.5*x + 90
][, sales12 := sales12 + runif(length(sales12))/100*sales12] # 增加随机扰动
# 画图
par(mfrow = c(2, 2))
curves_DT[, plot(DATE, sales9, type = "l", col = "steelblue", lwd = 2,
main = "curve9")]
curves_DT[, plot(DATE, sales10, type = "l", col = "steelblue", lwd = 2,
main = "curve10")]
curves_DT[, plot(DATE, sales11, type = "l", col = "steelblue", lwd = 2,
main = "curve11")]
curves_DT[, plot(DATE, sales12, type = "l", col = "steelblue", lwd = 2,
main = "curve12")]
```
<p style="color:red; font-size:200%; font-weight:bold">第四组曲线:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.width=8}
# 第13条曲线
curves_DT[, sales13 := 0.005*(24-x)^3 + 80
][, sales13 := sales13 + runif(length(sales13))/100*sales13] # 增加随机扰动
# 第14条曲线
curves_DT[, sales14 := 0.005*(24-x)^3 + 2*sin(pi/6*x) + 90
][, sales14 := sales14 + runif(length(sales14))/100*sales14] # 增加随机扰动
# 第15条曲线
curves_DT[, sales15 := 0.005*(24-x)^3 + 5*sin(pi/6*x)^2 + 100
][, sales15 := sales15 + runif(length(sales15))/100*sales15] # 增加随机扰动
# 第16条曲线
curves_DT[, sales16 := 0.005*(24-x)^3 + 5*cos(pi/6*x)^2 + 0.5*x + 90
][, sales16 := sales16 + runif(length(sales16))/100*sales16] # 增加随机扰动
# 画图
par(mfrow = c(2, 2))
curves_DT[, plot(DATE, sales13, type = "l", col = "steelblue", lwd = 2,
main = "curve13")]
curves_DT[, plot(DATE, sales14, type = "l", col = "steelblue", lwd = 2,
main = "curve14")]
curves_DT[, plot(DATE, sales15, type = "l", col = "steelblue", lwd = 2,
main = "curve15")]
curves_DT[, plot(DATE, sales16, type = "l", col = "steelblue", lwd = 2,
main = "curve16")]
```
# 数据缩放
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
curves_DT2 <-
curves_DT %>%
data.table::melt(id.vars = c("DATE", "x"),
variable.name = "curves",
value.name = "sales",
variable.factor = FALSE,
value.factor = FALSE) %>%
.[, !"x"] %>% .[, sales_scale := scale(sales),
by = .(curves)]
curves_DT2 %>%
as.data.frame() %>%
group_by(curves) %>%
e_charts(DATE) %>%
e_line(sales) %>%
e_title("产品销量走势") %>%
e_theme("chalk") %>%
e_tooltip(trigger = "axis")
```
# 相关系数
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
cor_df <- curves_DT[, !c("x", "DATE")] %>% as.data.frame()
rownames(cor_df) <- curves_DT$DATE
cor_df[1:6, 1:6]
cor_Mat <- cor_df %>%
as.matrix() %>% # 转换为矩阵
scale() %>% # 缩放
cor(method = "pearson", use = "pairwise.complete.obs") # 计算相关性, use避免NA报错
dim(cor_Mat)
is.na(cor_Mat) %>% sum() %>% `/`(., 2014^2) # 计算NA的比例
cor_Mat[1:6, 1:6]
```
# 聚类
  由于NA太多,在聚类之前,我们不妨进行主成分降维。降维之前,需要对`NA`进行处理,
这里不妨将相关系数绝对值$<0.8$的都转为`NA`, 然后将所有`NA`转为`0.1`。
因为缺失值和0太多会导致PCA和计算距离过程中报错。 <br/>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
cor_Mat2 <- ifelse(is.na(cor_Mat), 0.1, cor_Mat)
cor_Mat2 <- ifelse(cor_Mat2 %between% c(0, 0.8), 0.1, cor_Mat2)
cor_Mat2 <- ifelse(cor_Mat2 %between% c(-0.8, 0), -0.1, cor_Mat2)
cor_Mat2 <- matrix(cor_Mat2, ncol = ncol(cor_Mat))
cor_df2 <- cor_Mat2 %>% as.data.frame()
rownames(cor_df2) <- rownames(cor_Mat)
cor_df2[1:6, 1:6]
```
<p style="color:red; font-size:200%; font-weight:bold">PCA降维:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
start_time <- Sys.time()
pca <- principal(cor_df2[, -1], rotate = "none")
Sys.time() - start_time
plot(pca$values[1:10], type = "b",
ylab = "特征值", xlab = "主成分数量", col = "red")
```
  从上图可以看出,当主成分数量为3时,特征值趋于平稳。<br/>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
pca3 <- pca(cor_df2, nfactors = 3, rotate = "none")
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
pca3.result <- pca3$loadings %>%
as.vector() %>%
matrix(ncol = 3) %>%
as.data.frame()
colnames(pca3.result) <- paste0("PC", 1:3)
rownames(pca3.result) <- rownames(cor_df2)
head(pca3.result)
```
<p style="color:red; font-size:200%; font-weight:bold">层次聚类:</p>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.width=8}
hc <-
pca3.result %>%
dist() %>% # 计算distance矩阵,
hclust(method = "complete") # 层次聚类
hc %>%
as.dendrogram() %>%
plot()
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
myClusters <- hc %>%
as.dendrogram() %>%
cutree(3)
myClusters <- data.table(sales = names(myClusters),
tree = myClusters)
head(myClusters)
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
myTrees <- myClusters %>% split(by = "tree")
str(myTrees)
```
# 可视化结果
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.height=8}
p0 <- curves_DT2 %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("All Curves")
p1 <- curves_DT2[curves %chin% myTrees[[1]]$sales] %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster1")
p2 <- curves_DT2[curves %chin% myTrees[[2]]$sales] %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster2")
p3 <- curves_DT2[curves %chin% myTrees[[3]]$sales] %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster3")
(p0 + p1)/(p2 + p3) # 拼图
```
  看起来聚类效果还不错,只是cluster1中没有分离波动曲线与较平滑曲线。
如果要对这2类曲线进行分离,可以使用3次多项式拟合,然后根据残差绝对值和大小来区分,残差绝对值和大的属于波动剧烈的,
小的属于较平滑的曲线。<br/>
# 波动性分离
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
curve_model <- function(df){
lm(sales_scale ~ poly(DATE, 3), data = df)
}
fluctuate_df <- curves_DT2[curves %chin% myTrees[[1]]$sales] %>%
as.data.frame() %>%
group_nest(curves) %>% # 按曲线分组
mutate(model = map(data, curve_model)) %>% # 拟合模型
mutate(resids = map2(data, model, add_residuals)) %>% # 计算残差
mutate(resid_sum = map_dbl(resids, ~sum(abs(.x$resid)))) # 残差绝对值和
range(fluctuate_df$resid_sum)
head(fluctuate_df)
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
Cluster1 <- fluctuate_df %>% filter(resid_sum < 4) %>% # 分离残差和小的曲线
dplyr::select(curves, data) %>%
unnest(data)
Cluster2 <- fluctuate_df %>% filter(resid_sum >= 4) %>% # 分离残差和大的曲线
dplyr::select(curves, data) %>%
unnest(data)
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold", fig.width=12, fig.height=8}
p1 <- Cluster1 %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster1")
p2 <- Cluster2 %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster2")
p3 <- curves_DT2[curves %chin% myTrees[[2]]$sales] %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster3")
p4 <- curves_DT2[curves %chin% myTrees[[3]]$sales] %>%
as.data.frame() %>%
ggplot() +
geom_line(aes(x = DATE, y = sales_scale, group = curves, color = curves),
alpha = 0.5, show.legend = F) +
scale_x_date() +
ggtitle("cluster4")
(p1 + p2)/(p3 + p4) # 拼图
```
  至此,曲线分离完毕,通过多种机器学习方法相结合,实现了不错的分离效果。
这里因为曲线数量有限,没有将上升趋势的和下降趋势的曲线进行分开处理,如果曲线数量非常大,最好分开处理,
否则曲线太多,并相互交叉,难以区分。<br/>
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
```
```{r, max.print = 18, row.print = 6, tidy=FALSE, message=FALSE, results="hold", warning=FALSE, cache=FALSE, fig.show="hold"}
```