-
Notifications
You must be signed in to change notification settings - Fork 2
/
project_paper.Rmd
774 lines (618 loc) · 45.1 KB
/
project_paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
---
title: "Stats 101A Predicting College Basketball Team Winning Percentages"
author: "Oscar O'Brien, Kaijun Cui, Donald Chung"
output:
pdf_document: default
html_notebook: default
---
```{r, echo=FALSE, message = FALSE}
library(ggplot2)
library(dplyr)
library(gridExtra)
library(car)
library(GGally)
library(corrplot)
train <- read.csv("CBDtrain.csv")
test <- read.csv("CBDtestNoY.csv")
```
| Feature | Response |
|----------------------------|-----------------------------------------|
| Name | Oscar O'Brien, Kaijun Cui, Donald Chung |
| SID | 205338241, 105543204, |
| Kaggle Nickname | Oscar, Donald, Kaijun - Lec1 |
| Kaggle Rank | NEED TO UPDATE |
| Kaggle R^2 | 0.81733 |
| Total Number of Predictors | 12 |
| Total Number of Betas | 24 |
| BIC | -9886.535 |
| Complexity Grade | 86 |
# Abstract
The purpose of this project is to create a multiple linear regression model to predict the winning proportion of college basketball teams based game performance summary statistics[1]. This type of analysis is useful for situations like the annual March Madness Tournament where people around the world try to guess the outcome of a 64 team single elimination tournament for large sums of money[2]. We used variable selection techniques and regression theory to select a subset of variables to use in our multiple linear regression model to predict the winning proportions. In total, 12 predictors were used in our final model. Our team ("Oscar, Donald, Kaijun - Lec1") achieved a [RANK] on the class Kaggle competition. Our submitted model's adjusted $R^2$ value for training model is 0.82, and the model's $R^2$ value for the testing model is 0.81733.
# Introduction
We aim to understand which predictors are relevant to determining the win proportion of a team and by doing so construct a multiple linear regression model to predict win proportion. We have used data gathered from the 2013-2021 seasons of division I college basketball seasons in the United States. Our dataset has `r nrow(train) + nrow(test)` observations (2000 training and 1155 testing observations) and `r ncol(train)-1` predictors which consisted of different statistics gathered from the season (e.g. Free throws, etc.). The final regression model that was submitted to a class Kaggle competition[3] used to predict the win proportion of teams was used to predict the win proportion of a testing dataset containing 1155 observations. Below are the set of initial predictors provided with the dataset.
\newpage
| Feature | Response |
|----------------|-------------|
| X500.Level | Categorical |
| ADJOE | Numerical |
| ADJDE | Numerical |
| EFG_O | Numerical |
| EFG_D | Numerical |
| TOR | Numerical |
| TORD | Numerical |
| ORB | Numerical |
| DRB | Numerical |
| FTR | Numerical |
| FTRD | Numerical |
| X2P_O | Numerical |
| X2P_D | Numerical |
| X3P_O | Numerical |
| X3P_D | Numerical |
| WAB | Numerical |
| YEAR | Categorical |
| NCAA | Categorical |
| Power.Rating | Categorical |
| Adjusted.Tempo | Numerical |
# Methodology
## Analyzing the response variable: Win proportion (W.P)
We first take a look at the response variable, W.P, to determine if it is normal to decide whether or not it is necessary to perform a transformation on it. The density histogram below shows that it is approximately normal, so transformations are not immediately necessary.
```{r echo=FALSE, warning=FALSE, message=FALSE}
#summary(train$W.P)
ggplot(train, aes(x = W.P)) +
geom_histogram(aes(y = ..density.., color = 1)) +
geom_density(color = 2) +
labs(x = "Winning Percentage", title = "Density histogram of Winning Percentage")
#summary(powerTransform(W.P ~ 1, data = train))
```
## Analyzing the categorical variables
| Variable | Number of Categories |
|--------------|----------------------|
| X500.Level | 2 |
| YEAR | 9 |
| Power.Rating | 3 |
| NCAA | 2 |
```{r categorical variables, echo=FALSE, message=FALSE, warning=FALSE}
p1 <- ggplot(train, aes(x = X500.Level, y = W.P, fill = X500.Level)) +
geom_boxplot() +
theme(legend.position = "none")
p2 <- ggplot(train, aes(x = as.factor(YEAR), y = W.P, fill = as.factor(YEAR))) +
geom_boxplot() +
theme(legend.position = "none") +
labs(x = "YEAR")
p3 <- ggplot(train, aes(x = Power.Rating, y = W.P, fill = Power.Rating)) +
geom_boxplot() +
theme(legend.position = "none")
p4 <- ggplot(train, aes(x = NCAA, y = W.P, fill = NCAA)) +
geom_boxplot() +
theme(legend.position = "none")
grid.arrange(p1, p2, p3, p4, nrow = 2)
```
## Analyzing the numerical variables
We now inspect our numerical variables. We first look at the relationship between W.P are the numerical predictors. Below is a table of the numerical variable and the correlation coefficient between the variable and W.P:
| Variable | Correlation |
|----------------|-------------|
| WAB | 0.806 |
| ADJOE | 0.643 |
| ADJDE | -0.591 |
| EFG_O | 0.589 |
| X2P_O | 0.554 |
| EFG_D | -0.525 |
| X2P_D | -0.456 |
| X3P_O | 0.415 |
| X3P_D | -0.412 |
| TOR | -0.387 |
| DRB | -0.354 |
| FTRD | -0.266 |
| ORB | 0.245 |
| FTR | 0.147 |
| TORD | 0.146 |
| Adjusted.Tempo | -0.00671 |
Additionally here is a scatterplot matrix between the numberical variables
```{r echo =FALSE, message =FALSE, warning=FALSE}
ggpairs(train, columns = c("WAB", "ADJOE", "ADJDE", "EFG_O", "X2P_O", "EFG_D", "X2P_D", "X3P_O"))
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggpairs(train, columns = c("X3P_D", "TOR", "DRB", "FTRD", "ORB", "FTR", "TORD", "Adjusted.Tempo"))
```
We split the correlation plot since a plot with all the variables makes the plot unreadable. Based on the density plots of the numerical variables they all seem approximately normal so transformations to fix normality are not needed.
## Analyzing Interactions
Below are the most prominent interactions that were found along with their plots
```{r echo=FALSE, message=FALSE, warning=FALSE}
p1 <- ggplot(train, aes(x = ADJOE, y = W.P, group = X500.Level, color = X500.Level)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p2 <- ggplot(train, aes(x = WAB, y = W.P, group = X500.Level, color = X500.Level)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p3 <- ggplot(train, aes(x = ADJOE, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p4 <- ggplot(train, aes(x = ADJDE, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p5 <- ggplot(train, aes(x = EFG_O, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p6 <- ggplot(train, aes(x = TOR, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p7 <- ggplot(train, aes(x = ORB, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
p8 <- ggplot(train, aes(x = WAB, y = W.P, group = Power.Rating, color = Power.Rating)) +
geom_point(alpha = .5) +
stat_smooth(method = "loess")
grid.arrange(p1, p2, p3, p4, nrow = 2)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
grid.arrange(p5, p6, p7, p8, nrow = 2)
```
## Variable Selection with BIC and AIC
### AIC Backstep Model
We used backstep AIC and BIC to verify that our model is using the best subset of predictors to predict winning percentage. First using backstep AIC, the summary statistics are below. This model overfit with lots of predictors that are not significant, so it was worth looking at BIC because it has a larger penalty to too many predictors in a model.
```{r AIC model, echo=FALSE, warning=FALSE, message=FALSE}
m.a <- lm(formula = W.P ~ ADJOE + X500.Level + ADJDE + EFG_O + EFG_D +
TOR + TORD + ORB + DRB + FTR + FTRD + X2P_O + X2P_D + X3P_O +
X3P_D + WAB + Adjusted.Tempo + NCAA + Power.Rating + ADJOE:X500.Level +
X500.Level:ORB + X500.Level:X2P_O + WAB:Adjusted.Tempo +
X2P_O:NCAA + WAB:NCAA + ADJOE:Power.Rating + ADJDE:Power.Rating +
EFG_O:Power.Rating + TORD:Power.Rating + FTRD:Power.Rating +
X2P_O:Power.Rating + X3P_O:Power.Rating + ADJOE:ADJDE + ADJOE:EFG_O +
ADJOE:EFG_D + ADJOE:FTR + ADJOE:X3P_O + ADJOE:WAB + ADJOE:Adjusted.Tempo +
ADJDE:EFG_D + ADJDE:TOR + ADJDE:TORD + ADJDE:FTR + ADJDE:FTRD +
ADJDE:X2P_O + ADJDE:X3P_O + EFG_O:EFG_D + EFG_O:TOR + EFG_O:ORB +
EFG_O:X2P_O + EFG_O:WAB + EFG_O:Adjusted.Tempo + EFG_D:TOR +
EFG_D:TORD + EFG_D:ORB + EFG_D:FTR + EFG_D:X3P_D + EFG_D:WAB +
EFG_D:Adjusted.Tempo + TOR:ORB + TOR:FTR + TOR:X2P_D + TOR:X3P_O +
TOR:X3P_D + TORD:ORB + TORD:FTR + TORD:Adjusted.Tempo + ORB:DRB +
ORB:X2P_D + ORB:X3P_O + ORB:X3P_D + ORB:WAB + DRB:FTR + DRB:X3P_D +
FTR:FTRD + FTR:X3P_O + FTR:X3P_D + FTRD:X2P_O + X2P_O:X2P_D +
X2P_O:Adjusted.Tempo + X2P_D:WAB + X2P_D:Adjusted.Tempo +
X3P_O:WAB + X3P_O:Adjusted.Tempo + X3P_D:WAB, data = train)
#summary(m.a)
#aic_table <- summary(m.a)$coefficients
#aic_table <- round(aic_table, 7)
#write.csv(aic_table, "AIC model table.csv")
#aic_sum_stats <- data.frame("Observations" = 2000,
# "Residual Std Error" = round(summary(m.a)$sigma, 7),
# "R2" = round(summary(m.a)$r.squared, 4),
# "Adjusted R2" = round(summary(m.a)$adj.r.squared, 4))
#write.csv(aic_sum_stats, "AIC sum stats.csv", row.names = FALSE)
```
| | Estimate | Std. Error | t value | Pr(>\|t\|) |
|--------------------------|------------|------------|------------|------------|
| (Intercept) | 4.0054095 | 2.1281107 | 1.8821434 | 0.0599688 |
| ADJOE | -0.0296529 | 0.0242517 | -1.222714 | 0.2215889 |
| X500.LevelYES | 0.8503967 | 0.1425019 | 5.9676157 | 0 |
| ADJDE | 0.001397 | 0.023863 | 0.0585445 | 0.9533211 |
| EFG_O | 0.3101632 | 0.147381 | 2.1044991 | 0.0354656 |
| EFG_D | 0.1112445 | 0.1248427 | 0.8910774 | 0.3730001 |
| TOR | -0.0115156 | 0.0395842 | -0.2909152 | 0.7711478 |
| TORD | -0.0567102 | 0.0315993 | -1.7946645 | 0.0728656 |
| ORB | -0.0141021 | 0.0185733 | -0.7592676 | 0.4477864 |
| DRB | -0.0192112 | 0.0113711 | -1.6894781 | 0.0912914 |
| FTR | 0.0595799 | 0.0172358 | 3.4567589 | 0.0005587 |
| FTRD | 0.020957 | 0.0097514 | 2.1491347 | 0.0317493 |
| X2P_O | -0.1408757 | 0.0921257 | -1.5291687 | 0.1263886 |
| X2P_D | -0.1372335 | 0.0792554 | -1.7315354 | 0.0835181 |
| X3P_O | -0.2163702 | 0.0893662 | -2.421164 | 0.0155637 |
| X3P_D | -0.003158 | 0.0630261 | -0.0501067 | 0.9600426 |
| WAB | 0.0102746 | 0.0266412 | 0.3856637 | 0.6997888 |
| Adjusted.Tempo | -0.0774396 | 0.0206222 | -3.7551594 | 0.0001784 |
| NCAAYES | 0.2418339 | 0.1371613 | 1.7631353 | 0.0780379 |
| Power.RatingMEDIUM | 0.1456321 | 0.2179877 | 0.6680749 | 0.5041667 |
| Power.RatingSMALL | 0.121706 | 0.3128584 | 0.3890131 | 0.6973099 |
| ADJOE:X500.LevelYES | -0.0040354 | 0.0012765 | -3.1612462 | 0.0015957 |
| X500.LevelYES:ORB | -0.0044645 | 0.0015055 | -2.9654119 | 0.0030603 |
| X500.LevelYES:X2P_O | -0.0049215 | 0.0024139 | -2.0388103 | 0.0416067 |
| WAB:Adjusted.Tempo | -0.0008132 | 0.000237 | -3.4319834 | 0.000612 |
| X2P_O:NCAAYES | -0.0044018 | 0.0027186 | -1.6191494 | 0.1055806 |
| WAB:NCAAYES | -0.0064579 | 0.0035111 | -1.8392904 | 0.0660279 |
| ADJOE:Power.RatingMEDIUM | 0.0077266 | 0.0023544 | 3.2818154 | 0.00105 |
| ADJOE:Power.RatingSMALL | 0.0135774 | 0.0036802 | 3.6893417 | 0.0002311 |
| ADJDE:Power.RatingMEDIUM | -0.0055867 | 0.0022445 | -2.4891209 | 0.0128907 |
| ADJDE:Power.RatingSMALL | -0.0114453 | 0.003311 | -3.4567179 | 0.0005588 |
| EFG_O:Power.RatingMEDIUM | -0.0388028 | 0.0170988 | -2.2693336 | 0.0233594 |
| EFG_O:Power.RatingSMALL | -0.0072447 | 0.0194801 | -0.3719043 | 0.7100055 |
| TORD:Power.RatingMEDIUM | -0.0035402 | 0.0028418 | -1.245749 | 0.2130097 |
| TORD:Power.RatingSMALL | 0.0017696 | 0.0039573 | 0.4471646 | 0.6548071 |
| FTRD:Power.RatingMEDIUM | 0.0001431 | 0.001004 | 0.1424888 | 0.886709 |
| FTRD:Power.RatingSMALL | 0.0026394 | 0.0014524 | 1.8172472 | 0.0693362 |
| X2P_O:Power.RatingMEDIUM | 0.0209702 | 0.0109744 | 1.9108296 | 0.0561764 |
| X2P_O:Power.RatingSMALL | 0.0008631 | 0.0122795 | 0.070286 | 0.9439734 |
| X3P_O:Power.RatingMEDIUM | 0.0172804 | 0.0095623 | 1.807134 | 0.070899 |
| X3P_O:Power.RatingSMALL | -0.0034365 | 0.0111233 | -0.308949 | 0.7573941 |
| ADJOE:ADJDE | 0.0005443 | 0.0002115 | 2.573684 | 0.0101372 |
| ADJOE:EFG_O | -0.0010769 | 0.0004267 | -2.5237965 | 0.0116902 |
| ADJOE:EFG_D | -0.0006955 | 0.0003634 | -1.9141836 | 0.0557463 |
| ADJOE:FTR | -0.0002821 | 9.29e-05 | -3.0382161 | 0.0024121 |
| ADJOE:X3P_O | 0.0006831 | 0.0003683 | 1.8549083 | 0.0637636 |
| ADJOE:WAB | 0.0005034 | 0.0001843 | 2.7307173 | 0.0063779 |
| ADJOE:Adjusted.Tempo | 0.0005797 | 0.0002124 | 2.7292605 | 0.006406 |
| ADJDE:EFG_D | 0.0004603 | 0.0002748 | 1.6750338 | 0.0940916 |
| ADJDE:TOR | 0.0008949 | 0.0003323 | 2.6931111 | 0.0071409 |
| ADJDE:TORD | -0.0004576 | 0.0002605 | -1.756383 | 0.0791836 |
| ADJDE:FTR | -0.0004368 | 0.0001372 | -3.1840295 | 0.0014758 |
| ADJDE:FTRD | -0.0001997 | 8.59e-05 | -2.3255497 | 0.0201471 |
| ADJDE:X2P_O | -0.0007114 | 0.0002802 | -2.5389192 | 0.0111985 |
| ADJDE:X3P_O | -0.0004519 | 0.0002949 | -1.5324145 | 0.1255861 |
| EFG_O:EFG_D | 0.00115 | 0.0007134 | 1.6120714 | 0.107112 |
| EFG_O:TOR | -0.0023058 | 0.0006875 | -3.3538168 | 0.0008127 |
| EFG_O:ORB | 0.0007956 | 0.0003671 | 2.1672261 | 0.0303411 |
| EFG_O:X2P_O | 0.0007572 | 0.0003652 | 2.0731553 | 0.0382919 |
| EFG_O:WAB | 0.000799 | 0.0005159 | 1.5488461 | 0.1215847 |
| EFG_O:Adjusted.Tempo | -0.0032124 | 0.0021159 | -1.518268 | 0.1291127 |
| EFG_D:TOR | 0.0107441 | 0.0057452 | 1.8700998 | 0.0616231 |
| EFG_D:TORD | 0.0010386 | 0.0004686 | 2.2161359 | 0.0267999 |
| EFG_D:ORB | -0.0092134 | 0.0024238 | -3.8012935 | 0.0001485 |
| EFG_D:FTR | 0.0005194 | 0.0002623 | 1.9800135 | 0.0478456 |
| EFG_D:X3P_D | -0.0007649 | 0.0004184 | -1.8281322 | 0.0676859 |
| EFG_D:WAB | 0.0033181 | 0.0017785 | 1.8656946 | 0.0622376 |
| EFG_D:Adjusted.Tempo | -0.0010091 | 0.0005511 | -1.8310893 | 0.0672432 |
| TOR:ORB | 0.0004554 | 0.0002985 | 1.5258977 | 0.1272013 |
| TOR:FTR | -0.0004213 | 0.0002242 | -1.8793338 | 0.0603514 |
| TOR:X2P_D | -0.0076484 | 0.0037172 | -2.0575616 | 0.039768 |
| TOR:X3P_O | 0.0021799 | 0.0006917 | 3.1513187 | 0.0016506 |
| TOR:X3P_D | -0.0062106 | 0.0030603 | -2.029397 | 0.0425566 |
| TORD:ORB | 0.0006006 | 0.0002541 | 2.3638445 | 0.0181862 |
| TORD:FTR | -0.0005298 | 0.0002262 | -2.3425826 | 0.0192532 |
| TORD:Adjusted.Tempo | 0.0010631 | 0.0003168 | 3.3559722 | 0.0008064 |
| ORB:DRB | -0.0005371 | 0.0001685 | -3.1876099 | 0.0014578 |
| ORB:X2P_D | 0.0062219 | 0.00156 | 3.9884837 | 6.9e-05 |
| ORB:X3P_O | -0.0006556 | 0.0003232 | -2.0284462 | 0.0426535 |
| ORB:X3P_D | 0.0046478 | 0.0013067 | 3.5569292 | 0.0003844 |
| ORB:WAB | 0.0004191 | 0.0001502 | 2.7908954 | 0.0053086 |
| DRB:FTR | 0.0002372 | 0.00014 | 1.6940752 | 0.0904144 |
| DRB:X3P_D | 0.0005016 | 0.0002654 | 1.8897772 | 0.0589394 |
| FTR:FTRD | 0.0001313 | 6.28e-05 | 2.090745 | 0.0366831 |
| FTR:X3P_O | 0.0002679 | 0.0001747 | 1.5335136 | 0.1253153 |
| FTR:X3P_D | -0.0003892 | 0.0002174 | -1.7901454 | 0.0735892 |
| FTRD:X2P_O | -0.0001636 | 0.0001151 | -1.4215096 | 0.1553323 |
| X2P_O:X2P_D | 0.0006036 | 0.0003877 | 1.556969 | 0.1196439 |
| X2P_O:Adjusted.Tempo | 0.0020012 | 0.0013476 | 1.4850137 | 0.1377056 |
| X2P_D:WAB | -0.0024435 | 0.0011435 | -2.1369702 | 0.0327274 |
| X2P_D:Adjusted.Tempo | 0.0006605 | 0.0004331 | 1.5248549 | 0.1274612 |
| X3P_O:WAB | -0.0006295 | 0.0004536 | -1.387844 | 0.1653468 |
| X3P_O:Adjusted.Tempo | 0.0021271 | 0.0011871 | 1.7917886 | 0.0733254 |
| X3P_D:WAB | -0.0017679 | 0.0009611 | -1.839435 | 0.0660067 |
| Observations | Residual.Std.Error | $R^2$ | Adjusted $R^2$ |
|--------------|--------------------|--------|-------------|
| 2000 | 0.0781057 | 0.8411 | 0.8334 |
Moving to BIC, the summary statistics are shown for the predictor subset out of all two variable interactions. BIC recommends far less predictors than AIC as expected, and all interactions are significant except one. In total, the model uses 30 predictors with an adjusted $R^2$ of 0.8253.
### BIC Backstep Model
```{r BIC model, echo=FALSE, warning=FALSE, message=FALSE}
m.b <- lm(formula = W.P ~ ADJOE + X500.Level + ADJDE + EFG_O + EFG_D +
TOR + TORD + ORB + DRB + FTRD + X2P_D + WAB + Adjusted.Tempo +
ADJOE:X500.Level + X500.Level:DRB + WAB:Adjusted.Tempo +
ADJOE:Power.Rating + ADJDE:Power.Rating + FTRD:Power.Rating +
ADJOE:WAB + ADJOE:Adjusted.Tempo + ADJDE:FTRD + ADJDE:X2P_O +
TORD:Adjusted.Tempo + X2P_O:X2P_D + X2P_D:WAB, data = train)
#summary(m.b)
#bic_table <- summary(m.b)$coefficients
#bic_table <- round(bic_table, 7)
#write.csv(bic_table, "BIC model table.csv")
#bic_sum_stats <- data.frame("Observations" = 2000,
# "Residual Std Error" = round(summary(m.b)$sigma, 7),
# "R2" = round(summary(m.b)$r.squared, 4),
# "Adjusted R2" = round(summary(m.b)$adj.r.squared, 4))
#write.csv(bic_sum_stats, "BIC sum stats.csv", row.names = FALSE)
```
| | Estimate | Std. Error | t value | Pr(>\|t\|) |
|--------------------------|------------|------------|-------------|------------|
| (Intercept) | 4.0872589 | 1.2998372 | 3.1444392 | 0.0016889 |
| ADJOE | -0.04551 | 0.0100383 | -4.5336301 | 6.1e-06 |
| X500.LevelYES | 0.5315105 | 0.1109122 | 4.7921713 | 1.8e-06 |
| ADJDE | 0.0504118 | 0.0066518 | 7.5786294 | 0 |
| EFG_O | 0.0207607 | 0.0018756 | 11.068867 | 0 |
| EFG_D | -0.0193563 | 0.0019069 | -10.1508221 | 0 |
| TOR | -0.0126954 | 0.001484 | -8.5547497 | 0 |
| TORD | -0.0518984 | 0.0194335 | -2.6705597 | 0.0076348 |
| ORB | 0.0065183 | 0.0007191 | 9.0642784 | 0 |
| DRB | -0.0072228 | 0.001044 | -6.9182688 | 0 |
| FTRD | 0.0266694 | 0.0069422 | 3.8416415 | 0.0001261 |
| X2P_D | -0.0477884 | 0.0121936 | -3.91913 | 9.19e-05 |
| WAB | 0.0569002 | 0.0127716 | 4.4552274 | 8.9e-06 |
| Adjusted.Tempo | -0.0781789 | 0.0189494 | -4.1256725 | 3.85e-05 |
| ADJOE:X500.LevelYES | -0.0032 | 0.0010331 | -3.0974583 | 0.0019794 |
| X500.LevelYES:DRB | -0.0044306 | 0.0012171 | -3.640366 | 0.0002793 |
| WAB:Adjusted.Tempo | -0.0006059 | 0.000157 | -3.8594322 | 0.0001173 |
| ADJOE:Power.RatingMEDIUM | 0.0058284 | 0.0013739 | 4.242224 | 2.32e-05 |
| ADJOE:Power.RatingSMALL | 0.0108442 | 0.0017667 | 6.1380723 | 0 |
| ADJDE:Power.RatingMEDIUM | -0.0060468 | 0.0014051 | -4.3033445 | 1.76e-05 |
| ADJDE:Power.RatingSMALL | -0.0119556 | 0.0016891 | -7.0778966 | 0 |
| FTRD:Power.RatingMEDIUM | 4.8e-06 | 0.0008323 | 0.0057576 | 0.9954067 |
| FTRD:Power.RatingSMALL | 0.0035653 | 0.0010851 | 3.2857423 | 0.001035 |
| ADJOE:WAB | 0.0003449 | 7.4e-05 | 4.659524 | 3.4e-06 |
| ADJOE:Adjusted.Tempo | 0.0005362 | 0.0001474 | 3.6372545 | 0.0002826 |
| ADJDE:FTRD | -0.0002874 | 7.06e-05 | -4.0686417 | 4.92e-05 |
| ADJDE:X2P_O | -0.00047 | 0.0001148 | -4.0947164 | 4.4e-05 |
| TORD:Adjusted.Tempo | 0.0010532 | 0.0002858 | 3.6852747 | 0.0002346 |
| X2P_D:X2P_O | 0.0009131 | 0.0002388 | 3.8230342 | 0.0001359 |
| X2P_D:WAB | -0.0006124 | 0.0001166 | -5.2529786 | 2e-07 |
| Observations | Residual Std. Error | $R^2$ | Adjusted $R^2$ |
|--------------|--------------------|--------|-------------|
| 2000 | 0.0799765 | 0.8278 | 0.8253 |
Using the information from the backstep AIC and BIC, we created our final model, which uses less predictors without losing very much adjusted $R^2$. The summary statistics are shown below. All predictors are significant with p-values less than .01. In total, there are 24 betas from 12 predictors.
```{r updated_model, echo = FALSE, message=FALSE, warning=FALSE}
updated_model <- lm(W.P ~ EFG_O + EFG_D +
TOR + TORD + ORB + DRB +
FTRD + WAB + X500.Level +
ADJOE:X500.Level + WAB:X500.Level +
ADJOE:Power.Rating + ADJDE:Power.Rating + EFG_O:Power.Rating +
TOR:Power.Rating + ORB:Power.Rating,
data = train)
#summary(updated_model)
#model_table <- summary(updated_model)$coefficients
#model_table <- round(model_table, 7)
#write.csv(model_table, "Model table.csv")
#model_sum_stats <- data.frame("Observations" = 2000,
# "Residual Std Error" = round(summary(updated_model)$sigma, 7),
# "R2" = round(summary(updated_model)$r.squared, 4),
# "Adjusted R2" = round(summary(updated_model)$adj.r.squared, 4))
#write.csv(model_sum_stats, "Model sum stats.csv", row.names = FALSE)
```
| | Estimate | Std. Error | t value | Pr(>\|t\|) |
|--------------------------|------------|------------|-------------|------------|
| (Intercept) | -0.0956912 | 0.108723 | -0.8801375 | 0.3788919 |
| EFG_O | 0.025108 | 0.0019409 | 12.9364965 | 0 |
| EFG_D | -0.0173218 | 0.0014257 | -12.1496496 | 0 |
| TOR | -0.0184541 | 0.0021265 | -8.678167 | 0 |
| TORD | 0.0188746 | 0.0013756 | 13.7213678 | 0 |
| ORB | 0.0092178 | 0.0010933 | 8.4313574 | 0 |
| DRB | -0.0091768 | 0.0008059 | -11.3866487 | 0 |
| FTRD | -0.0018203 | 0.0003588 | -5.0734105 | 4e-07 |
| WAB | 0.0181589 | 0.0010879 | 16.6924915 | 0 |
| X500.LevelYES | 0.5972203 | 0.1123004 | 5.3180617 | 1e-07 |
| X500.LevelNO:ADJOE | -0.0101787 | 0.001335 | -7.6246476 | 0 |
| X500.LevelYES:ADJOE | -0.0148462 | 0.0012437 | -11.9367636 | 0 |
| WAB:X500.LevelYES | 0.0050946 | 0.0013771 | 3.6994496 | 0.000222 |
| ADJOE:Power.RatingMEDIUM | 0.0070205 | 0.0016043 | 4.3759501 | 1.27e-05 |
| ADJOE:Power.RatingSMALL | 0.0109665 | 0.0017807 | 6.1585679 | 0 |
| Power.RatingLARGE:ADJDE | 0.0130708 | 0.0011956 | 10.9323658 | 0 |
| Power.RatingMEDIUM:ADJDE | 0.0114744 | 0.0013279 | 8.6407309 | 0 |
| Power.RatingSMALL:ADJDE | 0.009315 | 0.0011239 | 8.2883985 | 0 |
| EFG_O:Power.RatingMEDIUM | -0.0110918 | 0.0026461 | -4.1918241 | 2.89e-05 |
| EFG_O:Power.RatingSMALL | -0.016703 | 0.0031358 | -5.3264853 | 1e-07 |
| TOR:Power.RatingMEDIUM | 0.0070655 | 0.0026112 | 2.7059044 | 0.0068704 |
| TOR:Power.RatingSMALL | 0.0140951 | 0.002978 | 4.7330394 | 2.4e-06 |
| ORB:Power.RatingMEDIUM | -0.0045416 | 0.001465 | -3.1000676 | 0.0019621 |
| ORB:Power.RatingSMALL | -0.0055318 | 0.001617 | -3.4210292 | 0.0006365 |
| Observations | Residual Std. Error | $R^2$ | Adjusted $R^2$ |
|--------------|--------------------|--------|-------------|
| 2000 | 0.0811705 | 0.8221 | 0.82 |
To verify that the added variables in the BIC recommended model are not significant, we conduced a partial f-test. We did not think a partial f-test was necessary with the AIC recommended model because the model was overfitting with many of the predictors being insignificant. Interestingly, the partial f-test produced a very significant p-value (<.001), which implies that at least one beta in the BIC model that was removed is not 0. However, in practice, the 6 added predictors in the BIC model only increased the adjusted $R^2$ value by 0.0053. As a result, we decided to accept that small lose in adjusted $R^2$ in favor of a simplier model.
### Partial F-test
```{r aic_f_test, echo=FALSE, message=FALSE, warning=FALSE}
#anova(updated_model, m.b)
#x <- anova(updated_model, m.b)
#anova_table <- data.frame("Res.Df" = x$Res.Df,
# "RSS" = round(x$RSS, 4),
# "Df" = x$Df,
# "Sum of Sq" = round(x$`Sum of Sq`, 4),
# "F" = round(x$`F`, 4),
# "Pr(>F)" = round(x$`Pr(>F)`, 7))
#write.csv(anova_table, "Anova table.csv", row.names = FALSE)
```
| Res Df | RSS | Df | Sum of Sq | F | Pr(>\|F\|) |
|--------|---------|----|-----------|---------|--------|
| 1976 | 13.0192 | NA | NA | NA | NA |
| 1970 | 12.6006 | 6 | 0.4186 | 10.9069 | 0 |
## Verifying VIF
To check to multicollinearity between our predictors, we made sure that the VIF for all the predictors was less than five. As shown below, all non interaction predictors have a VIF less than the cut off of five, but WAB is getting close to five. However, since all predictors have a VIF of less than five, our model is still valid.
```{r vif_check, echo=FALSE, message=FALSE, warning=FALSE}
updated_model_no_interaction <- lm(W.P ~ EFG_O + EFG_D +
TOR + TORD + ORB + DRB +
FTRD + WAB + X500.Level,
data = train)
#vif(updated_model_no_interaction)
#vif_table <- data.frame("Predictor" = names(vif(updated_model_no_interaction)),
# "VIF" = round(vif(updated_model_no_interaction), 5))
#write.csv(vif_table, "VIF table.csv", row.names = FALSE)
```
| Predictor | VIF |
|------------|---------|
| EFG_O | 2.10585 |
| EFG_D | 2.1256 |
| TOR | 1.63194 |
| TORD | 1.45436 |
| ORB | 1.56769 |
| DRB | 1.30315 |
| FTRD | 1.42125 |
| WAB | 4.80543 |
| X500.Level | 2.34033 |
## Verifying Diagnostics
The diagnostic plots below reinforce the fact that our model is valid. Clearly the plots of the residuals and standardized residuals show that the relationship between the predicted and fitted values is linear with a line that is very horizontal. Moreover, there are not patterns in the residuals, so the observations are independent. The normal q-q plot shows that the standardized residuals are quite normal with with only slight deviation from the perfectly normal line at the far ends. Finally, the standardized residuals show that the variance is mostly constant. The observations of the highest and lowest fitted values seems to have slightly less variance, but there are fewer observations for these fitted values and later we will show that performing transformations on the data does not improve any perceived variance problem.
Looking at the leverage plot, most of the leverage points are good leverage points with only a few leverage points being categorized as bad leverage points. More analysis is done on this below.
```{r diagnostic_plots, echo=FALSE, message=FALSE, warning=FALSE}
diag_plot2 <- function(x) {
p1 <- ggplot(x, aes(x$fitted.values, x$residuals)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "red") +
geom_hline(yintercept = 0, color = "gray", linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals", title = "Residual vs Fitted Plot")
p2 <- ggplot(x, aes(sample = rstandard(x))) +
stat_qq() +
stat_qq_line() +
labs(x = "Theoretical Quantiles", y = "Standardized Residuals", title = "Normal Q-Q")
p3 <- ggplot(x, aes(x$fitted.values, sqrt(abs(rstandard(x))))) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "loess", formula = y ~ x, na.rm = TRUE) +
labs(x = "Fitted Value", y = expression(sqrt("|Standardized Residuals|")),
title = "Scale-Location") +
geom_hline(yintercept = sqrt(2), color = "red", linetype = "dashed")
p4 <- ggplot(x, aes(hatvalues(x), rstandard(x))) +
geom_point(aes(size = cooks.distance(x)), na.rm = TRUE) +
scale_size_continuous("Cook's Distance", range = c(1,5)) +
labs(x = "Leverage", y = "Standardized Residuals",
title = "Residual vs Leverage Plot") +
geom_hline(yintercept = c(-2,2), color = "red", linetype = "dashed") +
geom_vline(xintercept = 2 * (nrow(summary(x)$coefficients)-1) / 2000, color = "blue", linetype = "dashed")
grid.arrange(p1, p2, p3, p4, nrow = 2)
}
diag_plot2(updated_model)
```
The table below shows that there are 86 good leverage points and 7 bad leverage points. One way we tried to handle bad leverage points was removing them from the data and retraining the model without them. However, this did not improve the model accuracy at all on Kaggle. We also looked into transformations to apply to the model in order to fix these bad leverage points, but they also did not improve the model. The transformation work is shown below.
### Bad Leverage Points
```{r bad_leverage, echo = FALSE, warning=FALSE, message=FALSE}
leverage <- ifelse(hatvalues(updated_model) > 2*23 / 2000, "Leverage", "Not Leverage")
outlier <- ifelse(abs(rstandard(updated_model)) > 2, "Outlier", "Not Outlier")
#table(leverage, outlier)
#write.csv(table(leverage, outlier), "leverage table.csv")
```
| | Not Outlier | Outlier |
|--------------|-------------|---------|
| Leverage | 86 | 7 |
| Not Leverage | 1828 | 79 |
## Model Transformations
### Inverse Regression
Using only inverse regression on our current model, the recommended lambda is 1.105, but we would not use this value exactly as not to overfit. Rounding to the nearest common lambda value gives a lambda of 1, which is not a transformation on y. The difference in RSS between 1.105 and 1 is less than 0.1, so inverse regression by itself recommends no transformations.
```{r only_inv_reg, echo=FALSE, warning=FALSE, message=FALSE}
#inverseResponsePlot(updated_model)
x <- inverseResponsePlot(updated_model)
#irp_table <- data.frame("lambda" = x$lambda,
# "RSS" = round(x$RSS, 6))
#write.table(irp_table, "irp table.csv", row.names = FALSE)
```
| lambda | RSS |
|------------------|-----------|
| 1.10495400573719 | 10.64304 |
| -1 | 60.061748 |
| 0 | 27.14729 |
| 1 | 10.703005 |
### Box-Cox Method
The output below shows the result of using a box-cox transformation on the strictly positive numerical predictors. Using the recommended transformations on the predictors yields a model with a worse adjusted $R^2$ value and some predictors are no longer independent. Moreover, the diagnostic plots shown look nearly identical to the model before the transformations were applied. As a result, using box-cox alone did not improve our model at all.
```{r box_cox, echo=FALSE, warning=FALSE, message=FALSE}
#summary(powerTransform(cbind(EFG_O, EFG_D, TOR, TORD, ORB,
# DRB, FTRD, W.P)~1, data = train))
transformed_model <- lm(W.P ~ I(EFG_O^.75) + I(EFG_D^.75) +
I(TOR^.25) + I(TORD^.5) + I(ORB^1.25) + I(DRB^.75) +
I(FTRD^.25) + WAB + X500.Level +
ADJOE:Power.Rating + ADJDE:Power.Rating + EFG_O:Power.Rating
, data = train)
#summary(transformed_model)
diag_plot2(transformed_model)
#x <- summary(powerTransform(cbind(EFG_O, EFG_D, TOR, TORD, ORB,
# DRB, FTRD, W.P)~1, data = train))
#tm_table <- x$result
#tm_table <- round(tm_table, 7)
#write.csv(tm_table, "Transformed Model table.csv")
#tmodel_table <- summary(transformed_model)$coefficients
#tmodel_table <- round(tmodel_table, 7)
#write.csv(tmodel_table, "TModel table.csv")
#tmodel_sum_stats <- data.frame("Observations" = 2000,
# "Residual Std Error" = round(summary(transformed_model)$sigma, 7),
# "R2" = round(summary(transformed_model)$r.squared, 4),
# "Adjusted R2" = round(summary(transformed_model)$adj.r.squared, 4))
#write.csv(tmodel_sum_stats, "TModel sum stats.csv", row.names = FALSE)
```
| | Est Power | Rounded Pwr | Wald Lwr Bnd | Wald Upr Bnd |
|-------|-----------|-------------|--------------|--------------|
| EFG_O | 0.7272337 | 1 | 0.2594333 | 1.1950342 |
| EFG_D | 0.7641039 | 1 | 0.1883705 | 1.3398373 |
| TOR | 0.251128 | 0 | -0.0475478 | 0.5498039 |
| TORD | 0.4915995 | 0.5 | 0.2284788 | 0.7547202 |
| ORB | 1.1717593 | 1 | 0.9424139 | 1.4011047 |
| DRB | 0.8511817 | 1 | 0.5441797 | 1.1581837 |
| FTRD | 0.1330578 | 0 | -0.0632025 | 0.3293182 |
| W.P | 0.9506723 | 1 | 0.8968297 | 1.0045149 |
| | Estimate | Std. Error | t value | Pr(>\|t\|) |
|--------------------------|------------|------------|-------------|------------|
| (Intercept) | 2.8812825 | 1.0469778 | 2.7519997 | 0.0059772 |
| I(EFG_O^0.75) | -0.3622398 | 0.2218888 | -1.6325281 | 0.1027272 |
| I(EFG_D^0.75) | -0.0603354 | 0.0050809 | -11.8750106 | 0 |
| I(TOR^0.25) | -0.4581421 | 0.0548756 | -8.3487441 | 0 |
| I(TORD^0.5) | 0.1623535 | 0.0119577 | 13.5773053 | 0 |
| I(ORB^1.25) | 0.0021917 | 0.0002502 | 8.759291 | 0 |
| I(DRB^0.75) | -0.0289527 | 0.0025115 | -11.5281391 | 0 |
| I(FTRD^0.25) | -0.1036041 | 0.0210778 | -4.9153124 | 1e-06 |
| WAB | 0.0202778 | 0.0009226 | 21.97905 | 0 |
| X500.LevelYES | 0.0784163 | 0.0061986 | 12.6507388 | 0 |
| ADJOE:Power.RatingLARGE | -0.0107827 | 0.0010677 | -10.0993214 | 0 |
| ADJOE:Power.RatingMEDIUM | -0.0068074 | 0.0014122 | -4.8205709 | 1.5e-06 |
| ADJOE:Power.RatingSMALL | -0.0039867 | 0.0013997 | -2.8483141 | 0.0044405 |
| Power.RatingLARGE:ADJDE | 0.0122557 | 0.001109 | 11.0510611 | 0 |
| Power.RatingMEDIUM:ADJDE | 0.0111774 | 0.0013298 | 8.4050252 | 0 |
| Power.RatingSMALL:ADJDE | 0.0091612 | 0.0010993 | 8.3334426 | 0 |
| Power.RatingLARGE:EFG_O | 0.1236495 | 0.0620627 | 1.9923307 | 0.0464718 |
| Power.RatingMEDIUM:EFG_O | 0.1177503 | 0.0625302 | 1.8830941 | 0.059834 |
| Power.RatingSMALL:EFG_O | 0.1163107 | 0.0632313 | 1.8394495 | 0.0659987 |
| Observations | Residual.Std.Error | R2 | Adjusted.R2 |
|--------------|--------------------|--------|-------------|
| 2000 | 0.0819522 | 0.8182 | 0.8165 |
### Box-Cox and Inverse Regression Together
After applying the box-cox transformation to the predictors, we then tried to see if transforming the Y variable with inverse regression would improve the model. However, the recommended lambda is 1.05, which rounds to 1. This means that the Y variable should not be transformed to minimize RSS.
```{r both_transformations, echo = FALSE, warning=FALSE, message=FALSE}
#inverseResponsePlot(transformed_model)
x <- inverseResponsePlot(transformed_model)
#both_table <- data.frame("lambda" = x$lambda,
# "RSS" = round(x$RSS, 6))
#write.table(both_table, "both table.csv", row.names = FALSE)
```
| lambda | RSS |
|------------------|-----------|
| 1.05030763404994 | 10.871579 |
| -1 | 59.787248 |
| 0 | 26.783034 |
| 1 | 10.88582 |
As a result of the work above, we could not find a transformed version of our model with any noticeable improvements. Hence, our final model did not employee any transformations.
# Results
## Final Model
Having looked at the diagnostics, variable selection, and potential transformations above, we would like to reiterate our final model. This model produced a score of 0.81733 on Kaggle, and the model includes 24 betas from 12 predictors.
```{r final_model, echo=FALSE, message=FALSE, warning=FALSE}
#summary(updated_model)
# updated model tables created above
```
| | Estimate | Std. Error | t value | Pr(>\|t\|) |
|--------------------------|------------|------------|-------------|------------|
| (Intercept) | -0.0956912 | 0.108723 | -0.8801375 | 0.3788919 |
| EFG_O | 0.025108 | 0.0019409 | 12.9364965 | 0 |
| EFG_D | -0.0173218 | 0.0014257 | -12.1496496 | 0 |
| TOR | -0.0184541 | 0.0021265 | -8.678167 | 0 |
| TORD | 0.0188746 | 0.0013756 | 13.7213678 | 0 |
| ORB | 0.0092178 | 0.0010933 | 8.4313574 | 0 |
| DRB | -0.0091768 | 0.0008059 | -11.3866487 | 0 |
| FTRD | -0.0018203 | 0.0003588 | -5.0734105 | 4e-07 |
| WAB | 0.0181589 | 0.0010879 | 16.6924915 | 0 |
| X500.LevelYES | 0.5972203 | 0.1123004 | 5.3180617 | 1e-07 |
| X500.LevelNO:ADJOE | -0.0101787 | 0.001335 | -7.6246476 | 0 |
| X500.LevelYES:ADJOE | -0.0148462 | 0.0012437 | -11.9367636 | 0 |
| WAB:X500.LevelYES | 0.0050946 | 0.0013771 | 3.6994496 | 0.000222 |
| ADJOE:Power.RatingMEDIUM | 0.0070205 | 0.0016043 | 4.3759501 | 1.27e-05 |
| ADJOE:Power.RatingSMALL | 0.0109665 | 0.0017807 | 6.1585679 | 0 |
| Power.RatingLARGE:ADJDE | 0.0130708 | 0.0011956 | 10.9323658 | 0 |
| Power.RatingMEDIUM:ADJDE | 0.0114744 | 0.0013279 | 8.6407309 | 0 |
| Power.RatingSMALL:ADJDE | 0.009315 | 0.0011239 | 8.2883985 | 0 |
| EFG_O:Power.RatingMEDIUM | -0.0110918 | 0.0026461 | -4.1918241 | 2.89e-05 |
| EFG_O:Power.RatingSMALL | -0.016703 | 0.0031358 | -5.3264853 | 1e-07 |
| TOR:Power.RatingMEDIUM | 0.0070655 | 0.0026112 | 2.7059044 | 0.0068704 |
| TOR:Power.RatingSMALL | 0.0140951 | 0.002978 | 4.7330394 | 2.4e-06 |
| ORB:Power.RatingMEDIUM | -0.0045416 | 0.001465 | -3.1000676 | 0.0019621 |
| ORB:Power.RatingSMALL | -0.0055318 | 0.001617 | -3.4210292 | 0.0006365 |
| Observations | Residual.Std.Error | R2 | Adjusted.R2 |
|--------------|--------------------|--------|-------------|
| 2000 | 0.0811705 | 0.8221 | 0.82 |
# Discussion
Below we further investigate the validity of our final model. Another feature we considered while testing our model was creating variables from the data that we were given. However, without any transformations, the numerical variables were all quite normal. Since we did not see any skewed data or any irregular shapes in the ggpairs density plots above, we were not able to create any variables that improve our model.
## Marginal Model Plots
To further verify our model, we plotted the marginal model plots to make sure the trend of the model and data lined up. The results show that the trends in the model and data line up extremely closely, which is the goal of the model. This builds on the analysis of the variables from earlier that showed that the relationships between W.P and individual predictors are linear, so polynomial terms would not improve the predictive capabilities of the model.
```{r mmps, echo=FALSE, warning=FALSE, message=FALSE}
model_single_numerical <- lm(W.P ~ EFG_O + EFG_D + TOR + TORD + ORB + DRB + FTRD +
WAB, data = train)
mmps(model_single_numerical)
```
## Added Variable Plots
Shown below are the added variable plots for our final model. In a perfect world, all of these plots should show a non-horizontal slope because that means that all of these betas contributes to the model while controlling for the influence of the other betas. None of the plots below show a perfectly horizontal blue line. Based on the fact that all of our predictors are significant, and that removing these betas quickly started to reduce our model's adjusted $R^2$, we kept these betas.
```{r added variable plots, message=FALSE, echo = FALSE, warning= FALSE}
avPlots(updated_model)
```
# Limitations and Conclusions
One potential concern with our model is the multicollinearity assumption. In the VIF section above, the model with all of the predictors without interaction terms passed the standard cutoff of five. However, the predictor WAB's VIF was 4.80543, which is near five. This means that if WAB was modeled with the other predictors in our final model, that model would have an $R^2$ value of 0.79170. This is a concern because failing multicollinearity means that beta estimates in the model will be inaccurate. Fortunately, the VIF value is still below the cutoff of five, and the variable is significant in the model meaning it is still helping for predicting winning percentage.
Moving to other diagnostics, a very slight concern with the diagnostics plots is the minor increase in residual variance for middle part of the data. Overall, the standardized residuals do not show that much of a difference and we tried box-cox and inverse response plot transformations to possibly improve this, but the transformations were unsuccessful by the evidence above. The only other note from the diagnostic plots is the seven bad leverage points. We tried removing them, but new bad leverage points; we tried transformations to improve the data, but these options did not help. While it is not ideal to have bad leverage points, we have 86 good leverage points, which is far more than the number of bad leverage points. These good leverage points serve to increase $R^2$ and do not have a negative effect on the beta coefficients[4].
A final limitation of our model comes from the partial f-test comparing our current model the BIC recommended model. The significant p-value indicated that there were betas in the BIC model that were not equal to 0, but we decided to proceed with six fewer betas because the cost was only 0.0053 on the model's adjusted $R^2$. Picking the best subset of predictors in a model is not a perfect art, and it requires some human input to weigh the benefits and costs[5].
Overall, we are satisfied with the model that we produced to predict winning percentage in college basketball based on game statistics. Approximately 82% of the variation of winning percentage is accounted for by the model, and our model is backed by a respectable Kaggle ranking in the class. More evidence that our model is reasonable is that it performs about the same on the training and testing data implying that it is not overfitting to the training data. The difference in adjusted $R^2$ on the training data and the Kaggle score is 0.00267. One thing that we were surprised by with this project was the lack of transformations that helped the data because of how normal the variable data was.
# References
[1] Almohalwas, A., 2022. STAT 101 A Winter 2022 Kaggle Competition.
[2] ESPN, 2022. NCAA Tournament Bracket Challenge 2022 | ESPN. [online] ESPN. Available at: <https://fantasy.espn.com/tournament-challenge-bracket/2022/en/story?pageName=tcmen\rules> [Accessed 18 March 2022].
[3] Almohalwas, A., 2022. Predicting Winning Proportions | Kaggle. [online] Kaggle.com. Available at: <https://www.kaggle.com/c/predicting-winning-proportions/submissions> [Accessed 18 March 2022].
[4] Almohalwas, A., 2022. chapter 3 scanned notes and Examples updated.
[5] Sheather, S., 2009. A Modern Approach to Regression with R. Springer, pp.232-233.