-
Notifications
You must be signed in to change notification settings - Fork 173
/
diagnostics.Rmd
964 lines (682 loc) · 36.4 KB
/
diagnostics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
# Model Diagnostics
```{r, include = FALSE}
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center")
```
> "Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in."
>
> --- **Isaac Asimov**
After reading this chapter you will be able to:
- Understand the assumptions of a regression model.
- Assess regression model assumptions using visualizations and tests.
- Understand leverage, outliers, and influential points.
- Be able to identify unusual observations in regression models.
## Model Assumptions
Recall the multiple linear regression model that we have defined.
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)} + \epsilon_i, \qquad i = 1, 2, \ldots, n.
\]
Using matrix notation, this model can be written much more succinctly as
\[
Y = X \beta + \epsilon.
\]
Given data, we found the estimates for the $\beta$ parameters using
\[
\hat{\beta} = \left( X^\top X \right)^{-1}X^\top y.
\]
We then noted that these estimates had mean
\[
\text{E}[\hat{\beta}] = \beta,
\]
and variance
\[
\text{Var}[\hat{\beta}] = \sigma^2 \left( X^\top X \right)^{-1}.
\]
In particular, an individual parameter, say $\hat{\beta}_j$ had a normal distribution
\[
\hat{\beta}_j \sim N\left(\beta_j, \sigma^2 C_{jj} \right)
\]
where $C$ was the matrix defined as
\[
C = \left(X^\top X\right)^{-1}.
\]
We then used this fact to define
\[
\frac{\hat{\beta}_j - \beta_j}{s_e \sqrt{C_{jj}}} \sim t_{n-p},
\]
which we used to perform hypothesis testing.
So far we have looked at various metrics such as RMSE, RSE, and $R^2$ to determine how well our model fits our data. Each of these in some way considers the expression
\[
\sum_{i = 1}^n (y_i - \hat{y}_i)^2.
\]
So, essentially each of these looks at how close the data points are to the model. However is that all we care about?
- It could be that the errors are made in a systematic way, which means that our model is misspecified. We may need additional interaction terms, or polynomial terms which we will see later.
- It is also possible that at a particular set of predictor values, the errors are very small, but at a different set of predictor values, the errors are large.
- Perhaps most of the errors are very small, but some are very large. This would suggest that the errors do not follow a normal distribution.
Are these issues that we care about? If all we would like to do is predict, possibly not, since we would only care about the size of our errors. However, if we would like to perform inference, for example to determine if a particular predictor is important, we care a great deal. All of the distributional results, such as a $t$-test for a single predictor, are derived under the assumptions of our model.
Technically, the assumptions of the model are encoded directly in a model statement such as,
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)} + \epsilon_i
\]
where $\epsilon_i \sim N(0, \sigma^2).$
Often, the **assumptions of linear regression**, are stated as,
- **L**inearity: the response can be written as a linear combination of the predictors. (With noise about this true linear relationship.)
- **I**ndependence: the errors are independent.
- **N**ormality: the distribution of the errors should follow a normal distribution.
- **E**qual Variance: the error variance is the same at any set of predictor values.
The linearity assumption is encoded as
\[
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i(p-1)},
\]
while the remaining three, are all encoded in
\[
\epsilon_i \sim N(0, \sigma^2),
\]
since the $\epsilon_i$ are $iid$ normal random variables with constant variance.
If these assumptions are met, great! We can perform inference, **and it is valid**. If these assumptions are *not* met, we can still "perform" a $t$-test using `R`, but the results are **not valid**. The distributions of the parameter estimates will not be what we expect. Hypothesis tests will then accept or reject incorrectly. Essentially, **garbage in, garbage out.**
## Checking Assumptions
We'll now look at a number of tools for checking the assumptions of a linear model. To test these tools, we'll use data simulated from three models:
\[
\text{Model 1:} \quad Y = 3 + 5x + \epsilon, \quad \epsilon \sim N(0, 1)
\]
\[
\text{Model 2:} \quad Y = 3 + 5x + \epsilon, \quad \epsilon \sim N(0, x^2)
\]
\[
\text{Model 3:} \quad Y = 3 + 5x^2 + \epsilon, \quad \epsilon \sim N(0, 25)
\]
```{r}
sim_1 = function(sample_size = 500) {
x = runif(n = sample_size) * 5
y = 3 + 5 * x + rnorm(n = sample_size, mean = 0, sd = 1)
data.frame(x, y)
}
sim_2 = function(sample_size = 500) {
x = runif(n = sample_size) * 5
y = 3 + 5 * x + rnorm(n = sample_size, mean = 0, sd = x)
data.frame(x, y)
}
sim_3 = function(sample_size = 500) {
x = runif(n = sample_size) * 5
y = 3 + 5 * x ^ 2 + rnorm(n = sample_size, mean = 0, sd = 5)
data.frame(x, y)
}
```
### Fitted versus Residuals Plot
Probably our most useful tool will be a **Fitted versus Residuals Plot**. It will be useful for checking both the **linearity** and **constant variance** assumptions.
Data generated from Model 1 above should not show any signs of violating assumptions, so we'll use this to see what a good fitted versus residuals plot should look like. First, we'll simulate observations from this model.
```{r}
set.seed(42)
sim_data_1 = sim_1()
head(sim_data_1)
```
We then fit the model and add the fitted line to a scatterplot.
```{r}
plot(y ~ x, data = sim_data_1, col = "grey", pch = 20,
main = "Data from Model 1")
fit_1 = lm(y ~ x, data = sim_data_1)
abline(fit_1, col = "darkorange", lwd = 3)
```
We now plot a fitted versus residuals plot. Note, this is residuals on the $y$-axis despite the ordering in the name. Sometimes you will see this called a residuals versus fitted, or residuals versus predicted plot.
```{r}
plot(fitted(fit_1), resid(fit_1), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Data from Model 1")
abline(h = 0, col = "darkorange", lwd = 2)
```
We should look for two things in this plot.
- At any fitted value, the mean of the residuals should be roughly 0. If this is the case, the *linearity* assumption is valid. For this reason, we generally add a horizontal line at $y = 0$ to emphasize this point.
- At every fitted value, the spread of the residuals should be roughly the same. If this is the case, the *constant variance* assumption is valid.
Here we see this is the case for both.
To get a better idea of how a fitted versus residuals plot can be useful, we will simulate from models with violated assumptions.
Model 2 is an example of non-constant variance. In this case, the variance is larger for larger values of the predictor variable $x$.
```{r}
set.seed(42)
sim_data_2 = sim_2()
fit_2 = lm(y ~ x, data = sim_data_2)
plot(y ~ x, data = sim_data_2, col = "grey", pch = 20,
main = "Data from Model 2")
abline(fit_2, col = "darkorange", lwd = 3)
```
This actually is rather easy to see here by adding the fitted line to a scatterplot. This is because we are only performing simple linear regression. With multiple regression, a fitted versus residuals plot is a necessity, since adding a fitted regression to a scatterplot isn't exactly possible.
```{r}
plot(fitted(fit_2), resid(fit_2), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Data from Model 2")
abline(h = 0, col = "darkorange", lwd = 2)
```
On the fitted versus residuals plot, we see two things very clearly. For any fitted value, the residuals seem roughly centered at 0. This is good! The linearity assumption is not violated. However, we also see very clearly, that for larger fitted values, the spread of the residuals is larger. This is bad! The constant variance assumption is violated here.
Now we will demonstrate a model which does not meet the linearity assumption. Model 3 is an example of a model where $Y$ is not a linear combination of the predictors. In this case the predictor is $x$, but the model uses $x^2$. (We'll see later that this is something that a "linear" model can deal with. The fix is simple, just make $x^2$ a predictor!)
```{r}
set.seed(42)
sim_data_3 = sim_3()
fit_3 = lm(y ~ x, data = sim_data_3)
plot(y ~ x, data = sim_data_3, col = "grey", pch = 20,
main = "Data from Model 3")
abline(fit_3, col = "darkorange", lwd = 3)
```
Again, this is rather clear on the scatterplot, but again, we wouldn't be able to check this plot for multiple regression.
```{r}
plot(fitted(fit_3), resid(fit_3), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Data from Model 3")
abline(h = 0, col = "darkorange", lwd = 2)
```
This time on the fitted versus residuals plot, for any fitted value, the spread of the residuals is about the same. However, they are not even close to centered at zero! At small and large fitted values the model is underestimating, while at medium fitted values, the model is overestimating. These are systematic errors, not random noise. So the constant variance assumption is met, but the linearity assumption is violated. The form of our model is simply wrong. We're trying to fit a line to a curve!
### Breusch-Pagan Test
Constant variance is often called **homoscedasticity**. Conversely, non-constant variance is called **heteroscedasticity**. We've seen how we can use a fitted versus residuals plot to look for these attributes.
While a fitted versus residuals plot can give us an idea about homoscedasticity, sometimes we would prefer a more formal test. There are many tests for constant variance, but here we will present one, the [**Breusch-Pagan Test**](https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test){target="_blank"}. The exact details of the test will be omitted here, but importantly the null and alternative can be considered to be,
- $H_0$: Homoscedasticity. The errors have constant variance about the true model.
- $H_1$: Heteroscedasticity. The errors have non-constant variance about the true model.
Isn't that convenient? A test that will specifically test the **constant variance** assumption.
The Breusch-Pagan Test can not be performed by default in `R`, however the function `bptest` in the `lmtest` package implements the test.
```{r, message = FALSE, warning = FALSE}
#install.packages("lmtest")
library(lmtest)
```
Let's try it on the three models we fit above. Recall,
- `fit_1` had no violation of assumptions,
- `fit_2` violated the constant variance assumption, but not linearity,
- `fit_3` violated linearity, but not constant variance.
```{r}
bptest(fit_1)
```
For `fit_1` we see a large p-value, so we do not reject the null of homoscedasticity, which is what we would expect.
```{r}
bptest(fit_2)
```
For `fit_2` we see a small p-value, so we reject the null of homoscedasticity. The constant variance assumption is violated. This matches our findings with a fitted versus residuals plot.
```{r}
bptest(fit_3)
```
Lastly, for `fit_3` we again see a large p-value, so we do not reject the null of homoscedasticity, which matches our findings with a fitted versus residuals plot.
### Histograms
We have a number of tools for assessing the normality assumption. The most obvious would be to make a histogram of the residuals. If it appears roughly normal, then we'll believe the errors could truly be normal.
```{r, fig.height=5, fig.width=15}
par(mfrow = c(1, 3))
hist(resid(fit_1),
xlab = "Residuals",
main = "Histogram of Residuals, fit_1",
col = "darkorange",
border = "dodgerblue",
breaks = 20)
hist(resid(fit_2),
xlab = "Residuals",
main = "Histogram of Residuals, fit_2",
col = "darkorange",
border = "dodgerblue",
breaks = 20)
hist(resid(fit_3),
xlab = "Residuals",
main = "Histogram of Residuals, fit_3",
col = "darkorange",
border = "dodgerblue",
breaks = 20)
```
Above are histograms for each of the three regression models we have been considering. Notice that the first, for `fit_1` appears very normal. The third, for `fit_3`, appears to be very non-normal. However `fit_2` is not as clear. It does have a rough bell shape, however, it also has a very sharp peak. For this reason we will usually use more powerful tools such as **Q-Q plots** and the **Shapiro-Wilk test** for assessing the normality of errors.
### Q-Q Plots
Another visual method for assessing the normality of errors, which is more powerful than a histogram, is a normal quantile-quantile plot, or **Q-Q plot** for short.
In `R` these are very easy to make. The `qqnorm()` function plots the points, and the `qqline()` function adds the necessary line. We create a Q-Q plot for the residuals of `fit_1` to check if the errors could truly be normally distributed.
```{r}
qqnorm(resid(fit_1), main = "Normal Q-Q Plot, fit_1", col = "darkgrey")
qqline(resid(fit_1), col = "dodgerblue", lwd = 2)
```
In short, if the points of the plot do not closely follow a straight line, this would suggest that the data do not come from a normal distribution.
The calculations required to create the plot vary depending on the implementation, but essentially the $y$-axis is the sorted data (observed, or sample quantiles), and the $x$-axis is the values we would expect if the data did come from a normal distribution (theoretical quantiles).
The [Wikipedia page for Normal probability plots](http://en.wikipedia.org/wiki/Normal_probability_plot){target="_blank"} gives details on how this is implemented in `R` if you are interested.
Also, to get a better idea of how Q-Q plots work, here is a quick function which creates a Q-Q plot:
```{r}
qq_plot = function(e) {
n = length(e)
normal_quantiles = qnorm(((1:n - 0.5) / n))
# normal_quantiles = qnorm(((1:n) / (n + 1)))
# plot theoretical verus observed quantiles
plot(normal_quantiles, sort(e),
xlab = c("Theoretical Quantiles"),
ylab = c("Sample Quantiles"),
col = "darkgrey")
title("Normal Q-Q Plot")
# calculate line through the first and third quartiles
slope = (quantile(e, 0.75) - quantile(e, 0.25)) / (qnorm(0.75) - qnorm(0.25))
intercept = quantile(e, 0.25) - slope * qnorm(0.25)
# add to existing plot
abline(intercept, slope, lty = 2, lwd = 2, col = "dodgerblue")
}
```
We can then verify that it is essentially equivalent to using `qqnorm()` and `qqline()` in `R`.
```{r, fig.height = 5, fig.width = 10}
set.seed(420)
x = rnorm(100, mean = 0 , sd = 1)
par(mfrow = c(1, 2))
qqnorm(x, col = "darkgrey")
qqline(x, lty = 2, lwd = 2, col = "dodgerblue")
qq_plot(x)
```
To get a better idea of what "close to the line" means, we perform a number of simulations, and create Q-Q plots.
First we simulate data from a normal distribution with different sample sizes, and each time create a Q-Q plot.
```{r, fig.height = 4, fig.width = 12}
par(mfrow = c(1, 3))
set.seed(420)
qq_plot(rnorm(10))
qq_plot(rnorm(25))
qq_plot(rnorm(100))
```
Since this data **is** sampled from a normal distribution, these are all, by definition, good Q-Q plots. The points are "close to the line" and we would conclude that this data could have been sampled from a normal distribution. Notice in the first plot, one point is *somewhat* far from the line, but just one point, in combination with the small sample size, is not enough to make us worried. We see with the large sample size, all of the points are rather close to the line.
Next, we simulate data from a $t$ distribution with a small degrees of freedom, for different sample sizes.
```{r, fig.height = 4, fig.width = 12}
par(mfrow = c(1, 3))
set.seed(420)
qq_plot(rt(10, df = 4))
qq_plot(rt(25, df = 4))
qq_plot(rt(100, df = 4))
```
Recall that as the degrees of freedom for a $t$ distribution become larger, the distribution becomes more and more similar to a normal. Here, using 4 degrees of freedom, we have a distribution that is somewhat normal, it is symmetrical and roughly bell-shaped, however it has "fat tails." This presents itself clearly in the third panel. While many of the points are close to the line, at the edges, there are large discrepancies. This indicates that the values are too small (negative) or too large (positive) compared to what we would expect for a normal distribution. So for the sample size of `100`, we would conclude that the normality assumption is violated. (If these were residuals of a model.) For sample sizes of `10` and `25` we may be suspicious, but not entirely confident. Reading Q-Q plots, is a bit of an art, not completely a science.
Next, we simulate data from an exponential distribution.
```{r, fig.height = 4, fig.width = 12}
par(mfrow = c(1, 3))
set.seed(420)
qq_plot(rexp(10))
qq_plot(rexp(25))
qq_plot(rexp(100))
```
This is a distribution that is not very similar to a normal, so in all three cases, we see points that are far from the lines, so we would think that the normality assumption is violated.
For a better understanding of which Q-Q plots are "good," repeat the simulations above a number of times (without setting the seed) and pay attention to the differences between those that are simulated from normal, and those that are not. Also consider different sample sizes and distribution parameters.
Returning to our three regressions, recall,
- `fit_1` had no violation of assumptions,
- `fit_2` violated the constant variance assumption, but not linearity,
- `fit_3` violated linearity, but not constant variance.
We'll now create a Q-Q plot for each to assess normality of errors.
```{r}
qqnorm(resid(fit_1), main = "Normal Q-Q Plot, fit_1", col = "darkgrey")
qqline(resid(fit_1), col = "dodgerblue", lwd = 2)
```
For `fit_1`, we have a near perfect Q-Q plot. We would believe the errors follow a normal distribution.
```{r}
qqnorm(resid(fit_2), main = "Normal Q-Q Plot, fit_2", col = "darkgrey")
qqline(resid(fit_2), col = "dodgerblue", lwd = 2)
```
For `fit_2`, we have a suspect Q-Q plot. We would probably **not** believe the errors follow a normal distribution.
```{r}
qqnorm(resid(fit_3), main = "Normal Q-Q Plot, fit_3", col = "darkgrey")
qqline(resid(fit_3), col = "dodgerblue", lwd = 2)
```
Lastly, for `fit_3`, we again have a suspect Q-Q plot. We would probably **not** believe the errors follow a normal distribution.
### Shapiro-Wilk Test
Histograms and Q-Q Plots give a nice visual representation of the residuals distribution, however if we are interested in formal testing, there are a number of options available. A commonly used test is the **Shapiro–Wilk test**, which is implemented in `R`.
```{r}
set.seed(42)
shapiro.test(rnorm(25))
shapiro.test(rexp(25))
```
This gives us the value of the test statistic and its p-value. The null hypothesis assumes the data were sampled from a normal distribution, thus a small p-value indicates we believe there is only a small probability the data could have been sampled from a normal distribution.
For details, see: [Wikipedia: Shapiro–Wilk test.](https://en.wikipedia.org/wiki/Shapiro-Wilk_test){target="_blank"}
In the above examples, we see we fail to reject for the data sampled from normal, and reject on the non-normal data, for any reasonable $\alpha$.
Returning again to `fit_1`, `fit_2`, and `fit_3`, we see the result of running `shapiro.test()` on the residuals of each, returns a result for each that matches decisions based on the Q-Q plots.
```{r}
shapiro.test(resid(fit_1))
```
```{r}
shapiro.test(resid(fit_2))
```
```{r}
shapiro.test(resid(fit_3))
```
## Unusual Observations
In addition to checking the assumptions of regression, we also look for any "unusual observations" in the data. Often a small number of data points can have an extremely large influence on a regression, sometimes so much so that the regression assumptions are violated as a result of these points.
The following three plots are inspired by an example from [Linear Models with R](http://www.maths.bath.ac.uk/~jjf23/LMR/){target="_blank"}.
```{r unusual_obs_plot, fig.height = 5, fig.width = 15}
par(mfrow = c(1, 3))
set.seed(42)
ex_data = data.frame(x = 1:10,
y = 10:1 + rnorm(n = 10))
ex_model = lm(y ~ x, data = ex_data)
# low leverage, large residual, small influence
point_1 = c(5.4, 11)
ex_data_1 = rbind(ex_data, point_1)
model_1 = lm(y ~ x, data = ex_data_1)
plot(y ~ x, data = ex_data_1, cex = 2, pch = 20, col = "grey",
main = "Low Leverage, Large Residual, Small Influence")
points(x = point_1[1], y = point_1[2], pch = 1, cex = 4, col = "black", lwd = 2)
abline(ex_model, col = "dodgerblue", lwd = 2)
abline(model_1, lty = 2, col = "darkorange", lwd = 2)
legend("bottomleft", c("Original Data", "Added Point"),
lty = c(1, 2), col = c("dodgerblue", "darkorange"))
# high leverage, small residual, small influence
point_2 = c(18, -5.7)
ex_data_2 = rbind(ex_data, point_2)
model_2 = lm(y ~ x, data = ex_data_2)
plot(y ~ x, data = ex_data_2, cex = 2, pch = 20, col = "grey",
main = "High Leverage, Small Residual, Small Influence")
points(x = point_2[1], y = point_2[2], pch = 1, cex = 4, col = "black", lwd = 2)
abline(ex_model, col = "dodgerblue", lwd = 2)
abline(model_2, lty = 2, col = "darkorange", lwd = 2)
legend("bottomleft", c("Original Data", "Added Point"),
lty = c(1, 2), col = c("dodgerblue", "darkorange"))
# high leverage, large residual, large influence
point_3 = c(14, 5.1)
ex_data_3 = rbind(ex_data, point_3)
model_3 = lm(y ~ x, data = ex_data_3)
plot(y ~ x, data = ex_data_3, cex = 2, pch = 20, col = "grey", ylim = c(-3, 12),
main = "High Leverage, Large Residual, Large Influence")
points(x = point_3[1], y = point_3[2], pch = 1, cex = 4, col = "black", lwd = 2)
abline(ex_model, col = "dodgerblue", lwd = 2)
abline(model_3, lty = 2, col = "darkorange", lwd = 2)
legend("bottomleft", c("Original Data", "Added Point"),
lty = c(1, 2), col = c("dodgerblue", "darkorange"))
```
The blue solid line in each plot is a regression fit to the 10 original data points stored in `ex_data`. The dashed orange line in each plot is the result of adding a single point to the original data in `ex_data`. This additional point is indicated by the circled point.
The slope of the regression for the original ten points, the solid blue line, is given by:
```{r}
coef(ex_model)[2]
```
The added point in the first plot has a *small* effect on the slope, which becomes:
```{r}
coef(model_1)[2]
```
We will say that this point has low leverage, is an outlier due to its large residual, but has small influence.
The added point in the second plot also has a *small* effect on the slope, which is:
```{r}
coef(model_2)[2]
```
We will say that this point has high leverage, is not an outlier due to its small residual, and has a very small influence.
Lastly, the added point in the third plot has a *large* effect on the slope, which is now:
```{r}
coef(model_3)[2]
```
This added point is influential. It both has high leverage, and is an outlier due to its large residual.
We've now mentioned three new concepts: leverage, outliers, and influential points, each of which we will discuss in detail.
### Leverage
A data point with high **leverage**, is a data point that *could* have a large influence when fitting the model.
Recall that,
\[
\hat{\beta} = \left(X^\top X \right)^{-1} X^\top y.
\]
Thus,
\[
\hat{y} = X \hat{\beta} = X \left(X^\top X \right)^{-1} X^\top y
\]
Now we define,
\[
H = X \left(X^\top X\right)^{-1} X^\top
\]
which we will refer to as the *hat matrix*. The hat matrix is used to project onto the subspace spanned by the columns of $X$. It is also simply known as a projection matrix.
The hat matrix is a matrix that takes the original $y$ values, and adds a hat!
\[
\hat{y} = H y
\]
The diagonal elements of this matrix are called the **leverages**
\[
H_{ii} = h_i,
\]
where $h_i$ is the leverage for the $i$th observation.
Large values of $h_i$ indicate extreme values in $X$, which may influence regression. Note that leverages only depend on $X$.
Here, $p$, the number of $\beta$s, is also the trace (and rank) of the hat matrix.
\[
\sum_{i = 1}^n h_i = p
\]
What is a value of $h_i$ that would be considered large? There is no exact answer to this question. A common heuristic would be to compare each leverage to two times the average leverage. A leverage larger than this is considered an observation to be aware of. That is, if
\[
h_i > 2 \bar{h}
\]
we say that observation $i$ has large leverage. Here,
\[
\bar{h} = \frac{\sum_{i = 1}^n h_i}{n} = \frac{p}{n}.
\]
For simple linear regression, the leverage for each point is given by
\[
h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{S_{xx}}.
\]
This expression should be familiar. (Think back to inference for SLR.) It suggests that the large leverages occur when $x$ values are far from their mean. Recall that the regression goes through the point $(\bar{x},\bar{y})$.
There are multiple ways to find leverages in `R`.
```{r}
lev_ex = data.frame(
x1 = c(0, 11, 11, 7, 4, 10, 5, 8),
x2 = c(1, 5, 4, 3, 1, 4, 4, 2),
y = c(11, 15, 13, 14, 0, 19, 16, 8))
plot(x2 ~ x1, data = lev_ex, cex = 2)
points(7, 3, pch = 20, col = "red", cex = 2)
```
Here we've created some multivariate data. Notice that we have plotted the $x$ values, not the $y$ values. The red point is $(7, 3)$ which is the mean of `x1` and the mean of `x2` respectively.
We could calculate the leverages using the expressions defined above. We first create the $X$ matrix, then calculate $H$ as defined, and extract the diagonal elements.
```{r}
X = cbind(rep(1, 8), lev_ex$x1, lev_ex$x2)
H = X %*% solve(t(X) %*% X) %*% t(X)
diag(H)
```
Notice here, we have two predictors, so the regression would have 3 $\beta$ parameters, so the sum of the diagonal elements is 3.
```{r}
sum(diag(H))
```
Alternatively, the method we will use more often, is to simply fit a regression, then use the `hatvalues()` function, which returns the leverages.
```{r}
lev_fit = lm(y ~ ., data = lev_ex)
hatvalues(lev_fit)
```
Again, note that here we have "used" the $y$ values to fit the regression, but `R` still ignores them when calculating the leverages, as leverages only depend on the $x$ values.
```{r}
coef(lev_fit)
```
Let's see what happens to these coefficients when we modify the `y` value of the point with the highest leverage.
```{r}
which.max(hatvalues(lev_fit))
lev_ex[which.max(hatvalues(lev_fit)),]
```
We see that the original `y` value is `r lev_ex[which.max(hatvalues(lev_fit)),3]`. We'll create a copy of the data, and modify this point to have a `y` value of `20`.
```{r}
lev_ex_1 = lev_ex
lev_ex_1$y[1] = 20
lm(y ~ ., data = lev_ex_1)
```
Notice the **large** changes in the coefficients. Also notice that each of the coefficients has changed in some way. Note that the leverages of the points would not have changed, as we have not modified any of the $x$ values.
Now let's see what happens to these coefficients when we modify the `y` value of the point with the lowest leverage.
```{r}
which.min(hatvalues(lev_fit))
lev_ex[which.min(hatvalues(lev_fit)),]
```
We see that the original `y` value is `r lev_ex[which.min(hatvalues(lev_fit)),3]`. We'll again create a copy of the data, and modify this point to have a `y` value of `30`.
```{r}
lev_ex_2 = lev_ex
lev_ex_2$y[4] = 30
lm(y ~ ., data = lev_ex_2)
```
This time despite a large change in the `y` value, there is only small change in the coefficients. Also, only the intercept has changed!
```{r}
mean(lev_ex$x1)
mean(lev_ex$x2)
lev_ex[4,]
```
Notice that this point was the mean of both of the predictors.
Returning to our three plots, each with an added point, we can calculate the leverages for each. Note that the 11th data point each time is the added data point.
```{r}
hatvalues(model_1)
hatvalues(model_2)
hatvalues(model_3)
```
Are any of these large?
```{r}
hatvalues(model_1) > 2 * mean(hatvalues(model_1))
hatvalues(model_2) > 2 * mean(hatvalues(model_2))
hatvalues(model_3) > 2 * mean(hatvalues(model_3))
```
We see that in the second and third plots, the added point is a point of high leverage. Recall that only in the third plot did that have an influence on the regression. To understand why, we'll need to discuss outliers.
### Outliers
Outliers are points which do not fit the model well. They may or may not have a large effect on the model. To identify outliers, we will look for observations with large residuals.
Note,
\[
e = y - \hat{y} = Iy - Hy = (I - H) y
\]
Then, under the assumptions of linear regression,
\[
\text{Var}(e_i) = (1 - h_i) \sigma^2
\]
and thus estimating $\sigma^2$ with $s_e^2$ gives
\[
\text{SE}[e_i] = s_e \sqrt{(1 - h_i)}.
\]
We can then look at the **standardized residual** for each observation, $i = 1, 2, \ldots n$,
\[
r_i = \frac{e_i}{s_e\sqrt{1 - h_i}} \overset{approx}{\sim} N(\mu = 0, \sigma^ 2 = 1)
\]
when $n$ is large.
We can use this fact to identify "large" residuals. For example, standardized residuals greater than 2 in magnitude should only happen approximately 5 percent of the time.
Returning again to our three plots, each with an added point, we can calculate the residuals and standardized residuals for each. Standardized residuals can be obtained in `R` by using `rstandard()` where we would normally use `resid()`.
```{r}
resid(model_1)
rstandard(model_1)
rstandard(model_1)[abs(rstandard(model_1)) > 2]
```
In the first plot, we see that the 11th point, the added point, is a large standardized residual.
```{r}
resid(model_2)
rstandard(model_2)
rstandard(model_2)[abs(rstandard(model_2)) > 2]
```
In the second plot, we see that there are no points with large standardized residuals.
```{r}
resid(model_3)
rstandard(model_3)
rstandard(model_3)[abs(rstandard(model_3)) > 2]
```
In the last plot, we see that the 11th point, the added point, is a large standardized residual.
Recall that the added point in plots two and three were both high leverage, but now only the point in plot three has a large residual. We will now combine this information and discuss influence.
### Influence
As we have now seen in the three plots, some outliers only change the regression a small amount (plot one) and some outliers have a large effect on the regression (plot three). Observations that fall into the latter category, points with (some combination of) *high leverage* **and** *large residual*, we will call **influential**.
A common measure of influence is **Cook's Distance**, which is defined as
\[
D_i = \frac{1}{p}r_i^2\frac{h_i}{1-{h_i}}.
\]
Notice that this is a function of both *leverage* and *standardized residuals*.
A Cook's Distance is often considered large if
\[
D_i > \frac{4}{n}
\]
and an observation with a large Cook's Distance is called influential. This is again simply a heuristic, and not an exact rule.
The Cook's distance for each point of a regression can be calculated using `cooks.distance()` which is a default function in `R`. Let's look for influential points in the three plots we had been considering.
```{r, ref.label="unusual_obs_plot", fig.height = 5, fig.width = 15, echo = FALSE}
```
Recall that the circled points in each plot have different characteristics:
- Plot One: low leverage, large residual.
- Plot Two: high leverage, small residual.
- Plot Three: high leverage, large residual.
We'll now directly check if each of these is influential.
```{r}
cooks.distance(model_1)[11] > 4 / length(cooks.distance(model_1))
cooks.distance(model_2)[11] > 4 / length(cooks.distance(model_2))
cooks.distance(model_3)[11] > 4 / length(cooks.distance(model_3))
```
And, as expected, the added point in the third plot, with high leverage and a large residual, is considered influential!
## Data Analysis Examples
### Good Diagnostics
Last chapter we fit an additive regression to the `mtcars` data with `mpg` as the response and `hp` and `am` as predictors. Let's perform some diagnostics on this model.
First, fit the model as we did last chapter.
```{r}
mpg_hp_add = lm(mpg ~ hp + am, data = mtcars)
```
```{r}
plot(fitted(mpg_hp_add), resid(mpg_hp_add), col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residual",
main = "mtcars: Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)
```
The fitted versus residuals plot looks good. We don't see any obvious pattern, and the variance looks roughly constant. (Maybe a little larger for large fitted values, but not enough to worry about.)
```{r}
bptest(mpg_hp_add)
```
The Breusch-Pagan test verifies this, at least for a small $\alpha$ value.
```{r}
qqnorm(resid(mpg_hp_add), col = "darkgrey")
qqline(resid(mpg_hp_add), col = "dodgerblue", lwd = 2)
```
The Q-Q plot looks extremely good and the Shapiro-Wilk test agrees.
```{r}
shapiro.test(resid(mpg_hp_add))
```
```{r}
sum(hatvalues(mpg_hp_add) > 2 * mean(hatvalues(mpg_hp_add)))
```
We see that there are two points of large leverage.
```{r}
sum(abs(rstandard(mpg_hp_add)) > 2)
```
There is also one point with a large residual. Do these result in any points that are considered influential?
```{r}
cd_mpg_hp_add = cooks.distance(mpg_hp_add)
sum(cd_mpg_hp_add > 4 / length(cd_mpg_hp_add))
large_cd_mpg = cd_mpg_hp_add > 4 / length(cd_mpg_hp_add)
cd_mpg_hp_add[large_cd_mpg]
```
We find two influential points. Interestingly, they are **very** different cars.
```{r}
coef(mpg_hp_add)
```
Since the diagnostics looked good, there isn't much need to worry about these two points, but let's see how much the coefficients change if we remove them.
```{r}
mpg_hp_add_fix = lm(mpg ~ hp + am,
data = mtcars,
subset = cd_mpg_hp_add <= 4 / length(cd_mpg_hp_add))
coef(mpg_hp_add_fix)
```
It seems there isn't much of a change in the coefficients as a result of removing the supposed influential points. Notice we did not create a new dataset to accomplish this. We instead used the `subset` argument to `lm()`. Think about what the code `cd_mpg_hp_add <= 4 / length(cd_mpg_hp_add)` does here.
```{r, fig.height = 8, fig.width = 8}
par(mfrow = c(2, 2))
plot(mpg_hp_add)
```
Notice that calling `plot()` on a variable which stores an object created by `lm()` outputs four diagnostic plots by default. Use `?plot.lm` to learn more. The first two should already be familiar.
### Suspect Diagnostics
Let's consider the model `big_model` from last chapter which was fit to the `autompg` dataset. It used `mpg` as the response, and considered many interaction terms between the predictors `disp`, `hp`, and `domestic`.
```{r, echo = FALSE}
# read data frame from the web
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
# give the dataframe headers
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
# remove missing data, which is stored as "?"
autompg = subset(autompg, autompg$hp != "?")
# remove the plymouth reliant, as it causes some issues
autompg = subset(autompg, autompg$name != "plymouth reliant")
# give the dataset row names, based on the engine, year and name
rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
# remove the variable for name
autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin"))
# change horsepower from character to numeric
autompg$hp = as.numeric(autompg$hp)
# create a dummary variable for foreign vs domestic cars. domestic = 1.
autompg$domestic = as.numeric(autompg$origin == 1)
# remove 3 and 5 cylinder cars (which are very rare.)
autompg = autompg[autompg$cyl != 5,]
autompg = autompg[autompg$cyl != 3,]
# the following line would verify the remaining cylinder possibilities are 4, 6, 8
#unique(autompg$cyl)
# change cyl to a factor variable
autompg$cyl = as.factor(autompg$cyl)
```
```{r}
str(autompg)
```
```{r}
big_model = lm(mpg ~ disp * hp * domestic, data = autompg)
```
```{r}
qqnorm(resid(big_model), col = "darkgrey")
qqline(resid(big_model), col = "dodgerblue", lwd = 2)
shapiro.test(resid(big_model))
```
Here both the Q-Q plot, and the Shapiro-Wilk test suggest that the normality assumption is violated.
```{r}
big_mod_cd = cooks.distance(big_model)
sum(big_mod_cd > 4 / length(big_mod_cd))
```
Here, we find `r sum(big_mod_cd > 4 / length(big_mod_cd))`, so perhaps removing them will help!
```{r}
big_model_fix = lm(mpg ~ disp * hp * domestic,
data = autompg,
subset = big_mod_cd < 4 / length(big_mod_cd))
qqnorm(resid(big_model_fix), col = "grey")
qqline(resid(big_model_fix), col = "dodgerblue", lwd = 2)
shapiro.test(resid(big_model_fix))
```
Removing these points results in a much better Q-Q plot, and now Shapiro-Wilk fails to reject for a low $\alpha$.
We've now seen that sometimes modifying the data can fix issues with regression. However, next chapter, instead of modifying the data, we will modify the model via *transformations*.
## `R` Markdown
The `R` Markdown file for this chapter can be found here:
- [`diagnostics.Rmd`](diagnostics.Rmd){target="_blank"}
The file was created using `R` version ``r paste0(version$major, "." ,version$minor)``.