-
Notifications
You must be signed in to change notification settings - Fork 7
/
08-MY451-regression.Rmd
2587 lines (2284 loc) · 148 KB
/
08-MY451-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Linear regression models {#c-regression}
## Introduction {#s-regression-intro}
This chapter continues the theme of analysing statistical associations
between variables. The methods described here are appropriate when the
response variable $Y$ is a continuous, interval level variable. We will
begin by considering bivariate situations where the only explanatory
variable $X$ is also a continuous variable. Section
\@ref(s-regression-descr) first discusses graphical and numerical
descriptive techniques for this case, focusing on two very commonly used
tools: a *scatterplot* of two variables, and a measure of association
known as the *correlation* coefficient. Section
\@ref(s-regression-simple) then describes methods of statistical
inference for associations between two continuous variables. This is
done in the context of a statistical model known as the *simple linear
regression model*.
The ideas of simple linear regression modelling can be extended to a
much more general and powerful set methods known as *multiple linear
regression models*. These can have several explanatory variables, which
makes it possible to examine associations between any explanatory
variable and the response variable, while controlling for other
explanatory variables. An important reason for the usefulness of these
models is that they play a key role in statistical analyses which
correspond to research questions that are causal in nature. As an
interlude, we discuss issues of causality in research design and
analysis briefly in Section \@ref(s-regression-causality). Multiple
linear models are then introduced in Section
\@ref(s-regression-multiple). The models can also include categorical
explanatory variables with any number of categories, as explained in
Section \@ref(s-regression-dummies).
The following example will be used for illustration throughout this
chapter:
**Example 8.1: Indicators of Global Civil Society**
The *Global Civil Society 2004/5* yearbook gives tables of a range of
characteristics of the countries of the world.^[Anheier, H., Glasius, M. and Kaldor, M. (eds.) (2005). *Global
Civil Society 2004/5*. London: Sage. The book gives detailed
references to the indices considered here. Many thanks to Sally
Stares for providing the data in an electronic form.] The following
measures will be considered in this chapter:
- Gross Domestic Product (**GDP**) per capita in 2001 (in current
international dollars, adjusted for purchasing power parity)
- **Income level** of the country in three groups used by the
Yearbook, as Low income, Middle income or High income
- **Income inequality** measured by the Gini index (with 0
representing perfect equality and 100 perfect inequality)
- A measure of **political rights and civil liberties** in 2004,
obtained as the average of two indices for these characteristics
produced by the Freedom House organisation (1 to 7, with higher
values indicating more rights and liberties)
- World Bank Institute’s measure of control of **corruption** for 2002
(with high values indicating low levels of corruption)
- Net **primary school enrolment** ratio 2000-01 (%)
- **Infant mortality rate** 2001 (% of live births)
We will discuss various associations between these variables. It should
be noted that the analyses are mainly illustrative examples, and the
choices of explanatory and response variables do not imply any strong
claims about causal connections between them. Also, the fact that
different measures refer to slightly different years is ignored; in
effect, we treat each variable as a measure of “recent” situation in the
countries. The full data set used here includes 165 countries. Many of
the variables are not available for all of them, so most of the analyses
below use a smaller number of countries.
## Describing association between two continuous variables {#s-regression-descr}
### Introduction {#ss-regression-descr-intro}
Suppose for now that we are considering data on two continuous
variables. The descriptive techniques discussed in this section do not
strictly speaking require a distinction between an explanatory variable
and a response variable, but it is nevertheless useful in many if not
most applications. We will reflect this in the notation by denoting the
variables $X$ (for the explanatory variable) and $Y$ (for the response
variable). The observed data consist of the pairs of observations
$(X_{1}, Y_{1}), (X_{2}, Y_{2}), \dots, (X_{n}, Y_{n})$ of $X$ and $Y$
for each of the $n$ subjects in a sample, or, with more concise
notation, $(X_{i}, Y_{i})$ for $i=1,2,\dots,n$.
We are interested in analysing the association between $X$ and $Y$.
Methods for *describing* this association in the sample are first
described in this section, initially with some standard graphical
methods in Section \@ref(ss-regression-descr-plots). This leads to a
discussion in Section \@ref(ss-regression-descr-assoc) of what we
actually mean by associations in this context, and then to a definion of
numerical summary measures for such associations in Section
\@ref(ss-regression-descr-corr). Statistical *inference* for the
associations will be considered in Section \@ref(s-regression-simple).
### Graphical methods {#ss-regression-descr-plots}
#### Scatterplots {-}
The standard statistical graphic for summarising the association between
two continuous variables is a **scatterplot**. An example of it is given
in Figure \@ref(fig:f-corruption1), which shows a scatterplot of Control of
corruption against GDP per capita for 61 countries for which the
corruption variable is at least 60 (the motivation of this restriction
will be discussed later). The two axes of the plot show possible values
of the two variables. The horizontal axis, here corresponding to Control
of corruption, is conventionally used for the explanatory variable $X$,
and is often referred to as the **X-axis**. The vertical axis, here used
for GDP per capita, then corresponds to the response variable $Y$, and
is known as the **Y-axis**.
![(\#fig:f-corruption1)A scatterplot of Control of corruption vs. GDP per capita in the Global Civil Society data set, for 61 countries with Control of corruption at least 60. The dotted lines are drawn to the point corresponding to the United Kingdom.](corruption1){width="13.5cm"}
The observed data are shown as points in the scatterplot, one for each
of the $n$ units. The location of each point is determined by its values
of $X$ and $Y$. For example, Figure \@ref(fig:f-corruption1) highlights the
observation for the United Kingdom, for which the corruption measure
($X$) is 94.3 and GDP per capita ($Y$) is \$24160. The point for UK is
thus placed at the intersection of a vertical line drawn from 94.3 on
the $X$-axis and a horizontal line from 24160 on the $Y$-axis, as shown
in the plot.
The principles of good graphical presentation on clear labelling,
avoidance of spurious decoration and so on (c.f. Section
\@ref(s-descr1-presentation)) are the same for scatterplots as for any
statistical graphics. Because the crucial visual information in a
scatterplot is the shape of the cloud of the points, it is now often not
necessary for the scales of the axes to begin at zero, especially if
this is well outside the ranges of the observed values of the variables
(as it is for the $X$-axis of Figure \@ref(fig:f-corruption1)). Instead, the
scales are typically selected so that the points cover most of the
plotting surface. This is done by statistical software, but there are
many situations were it is advisable to overrule the automatic selection
(e.g. for making scatterplots of the same variables in two different
samples directly comparable).
The main purpose of a scatterplot is to examine possible associations
between $X$ and $Y$. Loosely speaking, this means considering the shape
and orientation of the cloud of points in the graph. In Figure
\@ref(fig:f-corruption1), for example, it seems that most of the points are
in a cluster sloping from lower left to upper right. This indicates that
countries with low levels of Control of corruption (i.e. high levels of
corruption itself) tend to have low GDP per capita, and those with
little corruption tend to have high levels of GDP. A more careful
discussion of such associations again relates them to the formal
definition in terms of conditional distributions, and also provides a
basis for the methods of inference introduced later in this chapter. We
will resume the discussion of these issues in Section
\@ref(ss-regression-descr-assoc) below. Before that, however, we will
digress briefly from the main thrust of this chapter in order to
describe a slightly different kind of scatterplot.
#### Line plots for time series {-}
A very common special case of a scatterplot is one where the
observations correspond to measurements of a variable for the same unit
at several occasions over time. This is illustrated by the following
example (another one is Figure \@ref(fig:f-houseprices)):
*Example: Changes in temperature, 1903–2004*
Figure \@ref(fig:f-temperatures) summarises data on average annual
temperatures over the past century in five locations. The data were
obtained from the GISS Surface Temperature (GISTEMP) database maintained
by the NASA Goddard Institute for Space Studies.^[Accessible at `data.giss.nasa.gov/gistemp/`. The temperatures used
here are those listed in the data base under “after combining
sources at same location”.] The database
contains time series of average monthly surface temperatures from
several hundred meterological stations across the world. The five sites
considered here are Haparanda in Northern Sweden, Independence, Kansas
in the USA, Choshi on the east coast of Japan, Kimberley in South
Africa, and the Base Orcadas Station on Laurie Island, off the coast of
Antarctica. These were chosen rather haphazardly for this illustration,
with the aim of obtaining a geographically scattered set of rural or
small urban locations (to avoid issues with the heating effects of large
urban areas). The temperature for each year at each location is here
recorded as the difference from the temperature at that location in
1903.^[More specifically, the differences are between 11-year *moving
averages*, where each year is represented by the average of the
temperature for that year and the five years before and five after
it (except at the ends of the series, where fewer observations are
used). This is done to smooth out short-term fluctuations from the
data, so that longer-term trends become more clearly visible.]
![(\#fig:f-temperatures)Changes of average annual temperature (11-year moving averages) from 1903 in five locations. See the text for further details. Source: The GISTEMP database <data.giss.nasa.gov/gistemp/>](temperplot){width="13cm"}
Consider first the data for Haparanda only. Here we have two variables,
year and temperature, and 102 pairs of observations of them, one for
each year between 1903 and 2004. These pairs could now be plotted in a
scatterplot as described above. Here, however, we can go further to
enhance the visual effect of the plot. This is because the observations
represent measurements of a variable (temperature difference) for the
same unit (the town of Haparanda) at several successive times (years).
These 102 measurements form a *time series* of temperature differences
for Haparanda over 1903–2004. A standard graphical trick for such series
is to connect the points for successive times by lines, making it easy
for the eye to follow the changes over time in the variable on the
$Y$-axis. In Figure \@ref(fig:f-temperatures) this is done for Haparanda
using a solid line. Note that doing this would make no sense for scatter
plots like the one in Figure \@ref(fig:f-corruption1), because all the points
there represent different subjects, in that case countries.
We can easily include several such series in the same graph. In Figure
\@ref(fig:f-temperatures) this is done by plotting the temperature
differences for each of the five locations using different line styles.
The graph now summarises data on three variables, year, temperature and
location. We can then examine changes over time for any one location,
but also compare patterns of changes between them. Here there is clearly
much variation within and between locations, but also some common
features. Most importantly, the temperatures have all increased over the
past century. In all five locations the average annual temperatures at
the end of the period were around 1–2$^{\circ}$C higher than in 1903.
A set of time series like this is an example of dependent data in the
sense discussed in Section \@ref(s-means-dependent). There we considered
cases with pairs of observations, where the two observations in each
pair had to be treated as statistically dependent. Here all of the
temperature measurements for one location are dependent, probably with
strongest dependence between adjacent years and less dependence between
ones further apart. This means that we will not be able to analyse these
data with the methods described later in this chapter, because these
assume statistically independent observations. Methods of statistical
modelling and inference for dependent data of the kind illustrated by
the temperature example are beyond the scope of this course. This,
however, does not prevent us from using a plot like Figure
\@ref(fig:f-temperatures) to *describe* such data.
### Linear associations {#ss-regression-descr-assoc}
Consider again statistically independent observations of $(X_{i},
Y_{i})$, such as those displayed in Figure \@ref(fig:f-corruption1). Recall
the definition that two variables are associated if the conditional
distribution of $Y$ given $X$ is different for different values of $X$.
In the two-sample examples of Chapter \@ref(c-means) this could be
examined by comparing two conditional distributions, since $X$ had only
two possible values. Now, however, $X$ has many (in principle,
infinitely many) possible values, so we will need to somehow define and
compare conditional distributions given each of them. We will begin with
a rather informal discussion of how this might be done. This will lead
directly to a more precise and formal definition introduced in Section
\@ref(s-regression-simple).
![(\#fig:f-corruption2)The same scatterplot of Control of corruption vs. GDP per capita as in Figure \@ref(fig:f-corruption1), augmented by the best-fitting (least squares) straight line (solid line) and reference lines for two example values of Control of corruption (dotted lines).](corruption2){width="13.5cm"}
Figure \@ref(fig:f-corruption2) shows the same scatterplot as Figure
\@ref(fig:f-corruption1). Consider first one value of $X$ (Control of
corruption), say 65. To get a rough idea of the conditional distribution
of $Y$ (GDP per capita) given this value of $X$, we could examine the
sample distribution of the values of $Y$ for the units for which the
value of $X$ is close to 65. These correspond to the points near the
vertical line drawn at $X=65$ in Figure \@ref(fig:f-corruption2). This can be
repeated for any value of $X$; for example, Figure \@ref(fig:f-corruption2)
also includes a vertical reference line at $X=95$, for examining the
conditional distribution of $Y$ given $X=95$.^[This discussion is obviously rather approximate. Strictly
speaking, the conditional distribution of $Y$ given, say, $X=65$
refers only to units with $X$ exactly rather than approximately
equal to 65. This, however, is difficult to illustrate using a
sample, because most values of a continuous $X$ appear at most once
in a sample. For reasons discussed later in this chapter, the
present approximate treatment still provides a reasonable general
idea of the nature of the kinds of associations considered here.]
As in Chapter \@ref(c-means), associations between variables will here be
considered almost solely in terms of differences in the *means* of the
conditional distributions of $Y$ at different values of $X$. For
example, Figure \@ref(fig:f-corruption2) suggests that the conditional mean
of $Y$ when X is 65 is around or just under 10000. At $X=95$, on the
other hand, the conditional mean seems to be between 20000 and 25000.
The mean of $Y$ is thus higher at the larger value of X. More generally,
this finding is consistent across the scatterplot, in that the
conditional mean of $Y$ appears to increase when we consider
increasingly large values of $X$, indicating that higher levels of
Control of corruption are associated with higher average levels of GDP.
This is often expressed by saying that the conditional mean of $Y$
increases when we “increase” $X$.^[This wording is commonly used for convenience even in cases where
the nature of $X$ is such that its values can never actually be
manipulated.] This is the sense in which we will
examine associations between continuous variables: does the conditional
mean of $Y$ change (increase or decrease) when we increase $X$? If it
does, the two variables are associated; if it does not, there is no
association of this kind. This definition also agrees with the one
linking association with prediction: if the mean of $Y$ is different for
different values of $X$, knowing the value of $X$ will clearly help us
in making predictions about likely values of $Y$. Based on the
information in Figure \@ref(fig:f-corruption2), for example, our best guesses
of the GDPs of two countries would clearly be different if we were told
that the control of corruption measure was 65 for one country and 95 for
the other.
The *nature* of the association between $X$ and $Y$ is characterised by
*how* the values of $Y$ change when $X$ increases. First, it is almost
always reasonable to conceive these changes as reasonably smooth and
gradual. In other words, if two values of $X$ are close to each other,
the conditional means of $Y$ will be similar too; for example, if the
mean of $Y$ is 5 when $X=10$, its mean when $X=10.01$ is likely to be
quite close to 5 rather than, say, 405. In technical terms, this means
that the conditional mean of $Y$ will be described by a smooth
mathematical function of $X$. Graphically, the means of $Y$ as $X$
increases will then trace a smooth curve in the scatterplot. The
simplest possibility for such a curve is a straight line. This
possibility is illustrated by plot (a) of Figure \@ref(fig:f-scatterplots)
(this and the other five plots in the figure display artificial data,
generated for this illustration). Here all of the points fall on a line,
so that when $X$ increases, the values of $Y$ increase at a constant
rate. A relationship like this is known as a **linear association**
between $X$ and $Y$. Linear associations are the starting point for
examining associations between continuous variables, and often the only
ones considered. In this chapter we too will focus almost completely on
them.
![(\#fig:f-scatterplots)Scatterplots of artificial data sets of two variables. Each plot also shows the best-fitting (least squares) straight line and the correlation coefficient $r$.](scatterplots){width="13.5cm"}
In plot (a) of Figure \@ref(fig:f-scatterplots) all the points are exactly on
the straight line. This indicates a *perfect* linear association, where
$Y$ can be predicted exactly if $X$ is known, so that the association is
*deterministic*. Such a situation is neither realistic in practice, nor
necessary for the association to be described as linear. All that is
required for the latter is that the conditional *means* of $Y$ given
different values of $X$ fall (approximately) on a straight line. This is
illustrated by plot (b) of Figure \@ref(fig:f-scatterplots), which shows a
scatterplot of individual observations together with an approximation of
the line of the means of $Y$ given $X$ (how the line was drawn will be
explained later). Here the linear association is not perfect, as the
individual points are not all on the same line but scattered around it.
Nevertheless, the line seems to capture an important systematic feature
of the data, which is that the *average* values of $Y$ increase at an
approximately constant rate as $X$ increases. This combination of
systematic and random elements is characteristic of all statistical
associations, and it is also central to the formal setting for
statistical inference for linear associations described in Section
\@ref(s-regression-simple) below.
The **direction** of a linear association can be either **positive** or
**negative**. Plots (a) and (b) of Figure \@ref(fig:f-scatterplots) show a
positive association, because increasing $X$ is associated with
increasing average values of $Y$. This is indicated by the upward slope
of the line describing the association. Plot (c) shows an example of a
negative association, where the line slopes downwards and increasing
values of $X$ are associated with decreasing values of $Y$. The third
possibility, illustrated by plot (d), is that the line slopes neither up
nor down, so that the mean of $Y$ is the same for all values of $X$. In
this case there is no (linear) association between the variables.
Not all associations between continuous variables are linear, as shown
by the remaining two plots of Figure \@ref(fig:f-scatterplots). These
illustrate two kinds of **nonlinear** associations. In plot (e), the
association is still clearly *monotonic*, meaning that average values of
$Y$ change in the same direction — here increase — when $X$ increases.
The rate of this increase, however, is not constant, as indicated by the
slightly curved shape of the cloud of points. The values of $Y$ seem to
increase faster for small values of $X$ than for large ones. A straight
line drawn through the scatterplot captures the general direction of the
increase, but misses its nonlinearity. One practical example of such a
relationship is the one between years of job experience and salary: it
is often found that salary increases fastest early on in a person’s
career and more slowly later on.
Plot (f) shows a nonlinear and nonmonotonic relationship: as $X$
increases, average values of $Y$ first decrease to a minimum, and then
increase again, resulting in a U-shaped scatterplot. A straight line is
clearly an entirely inadequate description of such a relationship. A
nonmonotonic association of this kind might be seen, for example, when
considering the dependence of the failure rates of some electrical
components ($Y$) on their age ($X$). It might then be that the failure
rates were high early (from quick failures of flawed components) and
late on (from inevitable wear and tear) and lowest in between for
“middle-aged but healthy” components.
![(\#fig:f-corruption3)A scatterplot of Control of corruption vs. GDP per capita for 163 countries in the Global Civil Society data set. The solid line is the best-fitting (least squares) straight line for the points.](corruption3){width="13.5cm"}
Returning to real data, recall that we have so far considered control of
corruption and GDP per capita only among countries with a Control of
corruption score of at least 60. The scatterplot for these, shown in
Figure \@ref(fig:f-corruption2), also includes a best-fitting straight line.
The observed relationship is clearly positive, and seems to be fairly
well described by a straight line. For countries with relatively low
levels of corruption, the association between control of corruption and
GDP can be reasonably well characterised as linear.
Consider now the set of all countries, including also those with high
levels of corruption (scores of less than 60). In a scatterplot for
them, shown in Figure \@ref(fig:f-corruption3), the points with at least 60
on the $X$-axis are the same as those in Figure \@ref(fig:f-corruption2), and
the new points are to the left of them. The plot now shows a nonlinear
relationship comparable to the one in plot (e) of Figure
\@ref(fig:f-scatterplots). The linear relationship which was a good
description for the countries considered above is thus not adequate for
the full set of countries. Instead, it seems that the association is
much weaker for the countries with high levels of corruption,
essentially all of which have fairly low values of GDP per capita. The
straight line fitted to the plot identifies the overall positive
association, but cannot describe its nonlinearity. This example further
illustrates how scatterplots can be used to examine relationships
between variables and to assess whether they can be best described as
linear or nonlinear associations.^[In this particular example, a more closely linear association is
obtained by considering the logarithm of GDP as the response
variable instead of GDP itself. This approach, which is common in
dealing with skewed variables such as income, is, however, beyond
the scope of this course.]
So far we have said nothing about how the exact location and direction
of the straight lines shown in the figures have been selected. These are
determined so that the fitted line is in a certain sense the best
possible one for describing the data in the scatterplot. Because the
calculations needed for this are also (and more importantly) used in the
context of statistical inference for such data, we will postpone a
description of them until Section \@ref(ss-regression-simple-est). For
now we can treat the line simply as a visual summary of the linear
association in a scatterplot.
### Measures of association: covariance and correlation {#ss-regression-descr-corr}
A scatterplot is a very powerful tool for examining sample associations
of pairs of variables in detail. Sometimes, however, this is more than
we really need for an initial summary of a data set, especially if there
are many variables and thus many possible pairs of them. It is then
convenient also to be able to summarise each pairwise association using
a single-number measure of association. This section introduces the
correlation coefficient, the most common such measure for continuous
variables. It is a measure of the strength of *linear* associations of
the kind defined above.
Suppose that we consider two variables, denoted $X$ and $Y$. This again
implies a distinction between an explanatory and a response variable, to
maintain continuity of notation between different parts of this chapter.
The correlation coefficient itself, however, is completely symmetric, so
that its value for a pair of variables will be the same whether or not
we treat one or the other of them as explanatory for the other. First,
recall from equation of standard deviation towards the end of Section \@ref(ss-descr1-nums-variation) that the sample standard deviations of
the two variables are calculated as
\begin{equation}
s_{x} = \sqrt{\frac{\sum(X_{i}-\bar{X})^{2}}{n-1}} \text{and} s_{y} = \sqrt{\frac{\sum (Y_{i}-\bar{Y})^{2}}{n-1}}
(\#eq:sdyx)
\end{equation}
where the subscripts $x$ and $y$ identify the two
variables, and $\bar{X}$ and $\bar{Y}$ are their sample means. A new
statistic is the **sample covariance** between $X$ and $Y$, defined as
\begin{equation}
s_{xy} = \frac{\sum (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{n-1}.
(\#eq:sxy)
\end{equation}
This is a measure of linear association between $X$ and
$Y$. It is positive if the sample association is positive and negative
if the association is negative.
In theoretical statistics, covariance is the fundamental summary of
sample and population associations between two continuous variables. For
descriptive purposes, however, it has the inconvenient feature that its
magnitude depends on the units in which $X$ and $Y$ are measured. This
makes it difficult to judge whether a value of the covariance for
particular variables should be regarded as large or small. To remove
this complication, we can standardise the sample covariance by dividing
it by the standard deviations, to obtain the statistic
\begin{equation}
r=\frac{s_{xy}}{s_{x}s_{y}} = \frac{\sum (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum\left(X_{i}-\bar{X}\right)^{2} \sum\left(Y_{i}-\bar{Y}\right)^{2}}}.
(\#eq:corr)
\end{equation}
This is the (sample) **correlation** coefficient, or
correlation for short, between $X$ and $Y$. It is also often (e.g. in
SPSS) known as *Pearson’s* correlation coefficient after Karl Pearson
(of the $\chi^{2}$ test, see first footnote in Chapter \@ref(c-tables)), although both
the word and the statistic are really due to Sir Francis Galton.^[Galton, F. (1888). “Co-relations and their measurement, chiefly
from anthropometric data”. *Proceedings of the Royal Society of
London*, **45**, 135–145.]
The properties of the correlation coefficient can be described by going
through the same list as for the $\gamma$ coefficient in Section
\@ref(ss-descr1-2cat-gamma). While doing so, it is useful to refer to the
examples in Figure \@ref(fig:f-scatterplots), where the correlations are also
shown.
- **Sign**: Correlation is positive if the *linear* association
between the variables is positive, i.e. if the best-fitting straight
line slopes upwards (as in plots a, b and e) and negative if the
association is negative (c). A zero correlation indicates complete
lack of linear association (d and f).
- **Extreme values**: The largest possible correlation is $+1$
(plot a) and the smallest $-1$, indicating perfect positive and
negative linear associations respectively. More generally, the
magnitude of the correlation indicates the strength of the
association, so that the closer to $+1$ or $-1$ the correlation is,
the stronger the association (e.g. compare plots a–d). It should
again be noted that the correlation captures only the linear aspect
of the association, as illustrated by the two nonlinear cases in
Figure \@ref(fig:f-scatterplots). In plot (e), there is curvature but
also a strong positive trend, and the latter is reflected in a
fairly high correlation. In plot (f), the trend is absent and the
correlation is 0, even though there is an obvious
nonlinear relationship. Thus the correlation coefficient is a
reasonable initial summary of the strength of association in (e),
but completely misleading in (f).
- **Formal interpretation**: The correlation coefficient cannot be
interpreted as a Proportional Reduction in Error (PRE) measure, but
its square can. The latter statistic, so-called coefficient of
determination or $R^{2}$, is described in Section
\@ref(ss-regression-simple-int).
- **Substantive interpretation**: As with any measure of association,
the question of whether a particular sample correlation is high or
low is not a purely statistical question, but depends on the nature
of the variables. This can be judged properly only with the help of
experience of correlations between similar variables in
different contexts. As one very rough rule thumb it might be said
that in many social science contexts correlations greater than 0.4
(or smaller than $-0.4$) would typically be considered noteworthy
and ones greater than 0.7 quite strong.
Returning to real data, Table \@ref(tab:t-civilsoc-r) shows the correlation
coefficients for all fifteen distinct pairs of the six continuous
variables in the Global Civil Society data set mentioned in Example 8.1. This is an example of a **correlation matrix**,
which is simply a table with the variables as both its rows and columns,
and the correlation between each pair of variables given at the
intersection of corresponding row and column. For example, the
correlation of GDP per capita and School enrolment is here 0.42. This is
shown at the intersection of the first row (GDP) and fifth column
(School enrolment), and also of the fifth row and first column. In
general, every correlation is shown twice in the matrix, once in its
upper triangle and once in the lower. The triangles are separated by a
list of ones on the diagonal of the matrix. This simply indicates that
the correlation of any variable with itself is 1, which is true by
definition and thus of no real interest.
Variable GDP Gini Pol. Corrupt. School IMR
------------------------------------ -------- ------- ------- ----------- -------- -------
GDP per capita \[GDP ] 1 -0.39 0.51 0.77 0.42 -0.62
Income inequality \[Gini ] -0.39 1 -0.15 -0.27 -0.27 0.42
Political rights \[Pol. ] 0.51 -0.15 1 0.59 0.40 -0.44
Control of corruption \[Corrupt. ] 0.77 -0.27 0.59 1 0.41 -0.64
School enrolment \[School ] 0.42 -0.27 0.40 0.41 1 -0.73
Infant mortality \[IMR ] -0.62 0.42 -0.44 -0.64 -0.73 1
: (\#tab:t-civilsoc-r)Correlation matrix of six continuous variables in the Global Civil
Society data set. See Example 8.1 for more information
on the variables.
All of the observed associations in this example are in unsurprising
directions. For example, School enrolment is positively correlated with
GDP, Political rights and Control of corruption, and negatively
correlated with Income inequality and Infant mortality. In other words,
countries with large percentages of children enrolled in primary school
tend to have high levels of GDP per capita and of political rights and
civil liberties, and low levels of corruption, income inequality and
infant mortality. The strongest associations in these data are between
GDP per capita and Control of corruption ($r=0.77$) and School enrolment
and Infant mortality rate ($r=-0.73$), and the weakest between Income
inequality on the one hand and Political rights, Control of corruption
and School enrolment on the other (correlations of $-0.15$, $-0.27$ and
$-0.27$ respectively).
These correlations describe only the linear element of sample
associations, but give no hint of any nonlinear ones. For example, the
correlation of 0.77 between GDP and Control of corruption summarises the
way the observations cluster around the straight line shown in Figure
\@ref(fig:f-corruption3). The correlation is high because this increase in
GDP as Control of corruption increases is quite strong, but it gives no
indication of the nonlinearity of the association. A scatterplot is
needed for revealing this feature of the data. The correlation for the
restricted set of countries shown in Figure \@ref(fig:f-corruption2) is 0.82.
A correlation coefficient can also be defined for the joint population
distribution of two variables. The sample correlation $r$ can then be
treated as an estimate of the population correlation, which is often
denoted by $\rho$ (the lower-case Greek “rho”). Statistical inference
for the population correlation can also be derived. For example, SPSS
automatically outputs significance tests for the null hypothesis that
$\rho$ is 0, i.e. that there is no linear association between $X$ and
$Y$ in the population. Here, however, we will not discuss this, choosing
to treat $r$ purely as a descriptive sample statistic. The next section
provides a different set of tools for inference on population
associations.
## Simple linear regression models {#s-regression-simple}
### Introduction {#ss-regression-simple-intro}
The rest of this course is devoted to the method of linear regression
modelling. Its purpose is the analysis of associations in cases where
the response variable is a continuous, interval level variable, and the
possibly several explanatory variables can be of any type. We begin in
this section with *simple* linear regression, where there is only one
explanatory variable. We will further assume that this is also
continuous. The situation considered here is thus the same as in the
previous section, but here the focus will be on statistical inference
rather than description. Most of the main concepts of linear regression
can be introduced in this context. Those that go beyond it are described
in subsequent sections. Section \@ref(s-regression-multiple) introduces
*multiple* regression involving more than one explanatory variable. The
use of categorical explanatory variables in such models is explained in
Section \@ref(s-regression-dummies). Finally, Section
\@ref(s-regression-rest) gives a brief review of some further aspects of
linear regression modelling which are not covered on this course.
*Example: Predictors of Infant Mortality Rate*
The concepts of linear regression models will be illustrated as they are
introduced with a second example from the Global Civil Society data set.
The response variable will now be Infant Mortality Rate (IMR). This is
an illuminating outcome variable, because it is a sensitive and
unquestionably important reflection of a country’s wellbeing; whatever
we mean by “development”, it is difficult to disagree that high levels
of it should coincide with low levels of infant mortality. We will
initially consider only one explanatory variable, Net primary school
enrolment ratio, referred to as “School enrolment” for short. This is
defined as the percentage of all children of primary school age who are
enrolled in school. Enrolment numbers and the population size are often
obtained from different official sources, which sometimes leads to
discrepancies. In particular, School enrolment for several countries is
recorded as over 100, which is logically impossible. This is an
illustration of the kinds of measurement errors often affecting
variables in the social sciences. We will use the School enrolment
values as recorded, even though they are known to contain some error.
A scatterplot of IMR vs. School enrolment is shown in Figure
\@ref(fig:f-imr1), together with the best-fitting straight line. Later we
will also consider three additional explanatory variables: Control of
corruption, Income inequality and Income level of the country in three
categories (c.f. Example 8.1). For further reference,
Table \@ref(tab:t-imrvars) shows various summary statistics for these
variables. Throughout, the analyses are restricted to those 111
countries for which all of the five variables are recorded. For this
reason the correlations in Table \@ref(tab:t-imrvars) differ slightly from
those in Table \@ref(tab:t-civilsoc-r), where each correlation was calculated
for all the countries with non-missing values of that pair of variables.
![(\#fig:f-imr1)A scatterplot of net primary school enrolment ratio vs. Infant mortality rate for countries in the Global Civil Society data set ($n=111$). The solid line is the best-fitting (least squares) straight line for the points.](imr1){width="13.5cm"}
----------------------------------------------------------------------------------------------------
\ School Control of Income
IMR enrolment corruption inequality
------------------------------------------------------ ------- ----------- ------------ ------------
*Summary statistics*
Mean 4.3 86.1 50.1 40.5
std. deviation 4.0 16.7 28.4 10.2
Minimum 0.3 30.0 3.6 24.4
Maximum 15.6 109.0 100.0 70.7
*Correlation matrix*
IMR 1 -0.75 -0.60 0.39
School enrolment -0.75 1 0.39 -0.27
Control of corruption -0.60 0.39 1 -0.27
Income inequality 0.39 -0.27 -0.27 1
*Means for countries in different income categories*
Low income ($n=41$) 8.2 72.1 27.5 41.7
Middle income ($n=48$) 2.8 92.5 50.8 43.3
High income ($n=22$) 0.5 98.4 90.7 32.0
----------------------------------------------------------------------------------------------------
: (\#tab:t-imrvars)Summary statistics for Infant Mortality Rate (IMR) and explanatory
variables for it considered in the examples of Sections
\@ref(s-regression-simple) and \@ref(s-regression-multiple) ($n=111$).
See Example 8.1 for further information on the
variables.
### Definition of the model {#ss-regression-simple-def}
The simple linear regression model defined in this section is a
statistical model for a continuous, interval level response variable $Y$
given a single explanatory variable $X$, such as IMR given School
enrolment. The model will be used to carry out statistical inference on
the association between the variables in a population (which in the IMR
example is clearly again of the conceptual variety).
For motivation, recall first the situation considered in Section
\@ref(s-means-inference). There the data consisted of observations
$(Y_{i},
X_{i})$ for $i=1,2,\dots,n$, which were assumed to be statistically
independent. The response variable $Y$ was continuous but $X$ had only
two possible values, coded 1 and 2. A model was then set up where the
population distribution of $Y$ had mean $\mu_{1}$ and variance
$\sigma^{2}_{1}$ for units with $X=1$, and mean $\mu_{2}$ and variance
$\sigma^{2}_{2}$ when $X=2$. In some cases it was further assumed that
the population distributions were both normal, and that the population
variances were equal, i.e. that $\sigma^{2}_{1}=\sigma^{2}_{2}$, with
their common value denoted $\sigma^{2}$. With these further assumptions,
which will also be used here, the model for $Y$ given a dichotomous $X$
stated that (1) observations for different units $i$ were statistically
independent; (2) each $Y_{i}$ was sampled at random from a population
distribution which was normal with mean $\mu_{i}$ and variance
$\sigma^{2}$; and (3) $\mu_{i}$ depended on $X_{i}$ so that it was equal
to $\mu_{1}$ if $X_{i}$ was 1 and $\mu_{2}$ if $X_{i}$ was 2.
The situation in this section is exactly the same, except that $X$ is
now continuous instead of dichotomous. We will use the same basic model,
but will change the specification of the conditional mean $\mu_{i}$
appropriately. In the light of the discussion in previous sections of
this chapter, it is no surprise that this will be defined in such a way
that it describes a linear association between $X$ and $Y$. This is done
by setting $\mu_{i}=\alpha+\beta X_{i}$, where $\alpha$ and $\beta$ are
unknown population parameters. This is the equation of straight line (we
will return to it in the next section). With this specification, the
model for observations
$(Y_{1},X_{1}), (Y_{2}, X_{2}), \dots, (Y_{n}, X_{n})$ becomes
1. Observations for different units $i$ ($=1,2,\dots,n$) are
statistically independent.
2. Each $Y_{i}$ is normally distributed with mean $\mu_{i}$ and
variance $\sigma^{2}$.
3. The means $\mu_{i}$ depend on $X_{i}$ through $\mu_{i}=\alpha+\beta
X_{i}$.
Often the model is expressed in an equivalent form where 2. and 3. are
combined as
\begin{equation}
Y_{i}=\alpha+\beta X_{i} +\epsilon_{i}
(\#eq:slinmodel)
\end{equation}
where each $\epsilon_{i}$ is normally distributed
with mean 0 and variance $\sigma^{2}$. The $\epsilon_{i}$ are known as
**error terms** or **population residuals** (and the letter $\epsilon$
is the lower-case Greek “epsilon”). This formulation of the model
clearly separates the mean of $Y_{i}$, which traces the straight line
$\alpha+\beta X_{i}$ as $X_{i}$ changes, from the variation around that
line, which is described by the variability of $\epsilon_{i}$.
The model defined above is known as the **simple linear regression
model**:
- **Simple** because it has only one explanatory variable, as opposed
to *multiple* linear regression models which will have more
than one.
- **Linear** because it specifies a linear association between $X$ and
$Y$.^[This is slightly misleading: what actually matters in general is
that the conditional mean is a linear function of the *parameters*
$\alpha$ and $\beta$. This need not concern us at this stage.]
- **Regression**: This is now an established part of the name of the
model, although the origins of the word are not central to the use
of the model.^[Galton, F. (1886). “Regression towards mediocrity in hereditary
stature”. *Journal of the Anthropological Institute*, **15**,
246–263. The original context is essentially the one discussed on
courses on research design as “regression toward the mean”.]
- **Model**, because this is a statistical model in the sense
discussed in the middle of Section \@ref(ss-contd-probdistrs-general). In other words, the model is
always only a simplified abstraction of the true, immeasurably
complex processes which determine the values of $Y$. Nevertheless,
it is believed that a well-chosen model can be useful for explaining
and predicting observed values of $Y$. This spirit is captured by
the well-known statement by the statistician George Box:^[This exact phrase apparently first appears in Box, G.E.P. (1979).
Robustness in the strategy of scientific model building. In Launer,
R.L. and Wilkinson, G.N., *Robustness in Statistics*, pp. 201–236.]
> *All models are wrong, but some are useful.*
A model like this has the advantage that it reduces the examination
of associations in the population to estimation and inference on a
small number of model parameters, in the case of the simple linear
regression model just $\alpha$, $\beta$ and $\sigma^{2}$.
Of course, not all models are equally appropriate for given data, and
some will be both wrong and useless. The results from a model should
thus be seriously presented and interpreted only if the model is deemed
to be reasonably adequate. For the simple linear regression model, this
can be partly done by examining whether the scatterplot between $X$ and
$Y$ appears to be reasonably consistent with a linear relationship. Some
further comments on the assessment of model adequacy will be given in
Section \@ref(s-regression-rest).
### Interpretation of the model parameters {#ss-regression-simple-int}
The simple linear regression model (\@ref(eq:slinmodel)) has three
parameters, $\alpha$, $\beta$ and $\sigma^{2}$. Each of these has its
own interpretation, which are explained in this section. Sometimes it
will be useful to illustrate the definition with specific numerical
values, for which we will use ones for the model for IMR given School
enrolment in our example. SPSS output for this model is shown in Figure
\@ref(fig:f-spss-linreg). Note that although these values are first used here
to illustrate the interpretation of the *population* parameters in the
model, they are of course only estimates (of a kind explained in the
next section) of those parameters. Other parts of the SPSS output will
be explained later in this chapter.
![(\#fig:f-spss-linreg)SPSS output for a simple linear regression model for Infant mortality rate given School enrolment in the Global Civil Society data.](spsslinreg){width="15.5cm"}
According to the model, the conditional mean (also often known as the
conditional **expected value**) of $Y$ given $X$ in the population is
(dropping the subscript $i$ for now for notational simplicity)
$\mu=\alpha+\beta X$. The two parameters $\alpha$ and $\beta$ in this
formula are known as **regression coefficients**. They are interpreted
as follows:
- $\alpha$ is the expected value of $Y$ when $X$ is equal to 0. It is
known as the **intercept** or **constant** term of the model.
- $\beta$ is the change in the expected value of $Y$ when $X$
increases by 1 unit. It is known as the **slope** term or the
**coefficient of** $X$.
Just to include one mathematical proof in this coursepack, these results
can be derived as follows:
- When $X=0$, the mean of $Y$ is
$\mu=\alpha+\beta X=\alpha+\beta\times 0
=\alpha+0=\alpha$.
- Compare two observations, one with value $X$ of the explanatory
variable, and the other with one unit more, i.e. $X+1$. The
corresponding means of $Y$ are
------------- ------- ----------------------------- --------------------------
with $X+1$: $\mu$ $=\alpha+\beta\times (X+1)$ $=\alpha+\beta X +\beta$
with $X$: $\mu$ $=\alpha+\beta X$
Difference: $\beta$
------------- ------- ----------------------------- --------------------------
which completes the proof of the claims above — Q.E.D. In case you
prefer a graphical summary, this is given in Figure
\@ref(fig:f-linmod-params).
![(\#fig:f-linmod-params)Illustration of the interpretation of the regression coefficients of a simple linear regression model.](lmparams){width="12.5cm"}
The most important parameter of the model, and usually the only one
really discussed in interpreting the results, is $\beta$, the regression
coefficient of $X$. It is also called the slope because it is literally
the slope of the regression line, as shown in Figure
\@ref(fig:f-linmod-params). It is the only parameter in the model which
describes the association between $X$ and $Y$, and it does so in the
above terms of expected changes in $Y$ corresponding to changes in X
($\beta$ is also related to the correlation between $X$ and $Y$, in a
way explained in the next section). The sign of $\beta$ indicates the
direction of the association. When $\beta$ is positive (greater than 0),
the regression line slopes upwards and increasing $X$ thus also
increases the expected value of $Y$ — in other words, the association
between $X$ and $Y$ is positive. This is the case illustrated in Figure
\@ref(fig:f-linmod-params). If $\beta$ is negative, the regression line
slopes downwards and the association is also negative. Finally, if
$\beta$ is zero, the line is parallel with the $X$-axis, so that
changing $X$ does not change the expected value of $Y$. Thus $\beta=0$
corresponds to no (linear) association between $X$ and $Y$.
In the real example shown in Figure \@ref(fig:f-spss-linreg), $X$ is School
enrolment and $Y$ is IMR. In SPSS output, the estimated regression
coefficients are given in the “**Coefficients**” table in the column
labelled “B” under “Unstandardized coefficients”. The estimated constant
term $\alpha$ is given in the row labelled “(Constant)”, and the slope
term on the next row, labelled with the name or label of the explanatory
variable as specified in the SPSS data file — here “Net primary school
enrolment ratio 2000-2001 (%)”. The value of the intercept is here
19.736 and the slope coefficient is $-0.179$. The estimated regression
line for expected IMR is thus $19.736-0.179 X$, where $X$ denotes School
enrolment. This is the line shown in Figure \@ref(fig:f-imr1).
Because the slope coefficient in the example is negative, the
association between the variables is also negative, i.e. higher levels
of school enrolment are associated with lower levels of infant
mortality. More specifically, every increase of one unit (here one
percentage point) in School enrolment is associated with a decrease of
0.179 units (here percentage points) in expected IMR.
Since the meaning of $\beta$ is related to a unit increase of the
explanatory variable, the interpretation of its magnitude depends on
what those units are. In many cases one unit of $X$ is too small or too
large for convenient interpretation. For example, a change of one
percentage point in School enrolment is rather small, given that the
range of this variable in our data is 79 percentage points (c.f. Table
\@ref(tab:t-imrvars)). In such cases the results can easily be reexpressed by
using multiples of $\beta$: specifically, the effect on expected value
of $Y$ of changing $X$ by $A$ units is obtained by multiplying $\beta$
by $A$. For instance, in our example the estimated effect of increasing
School enrolment by 10 percentage points is to decrease expected IMR by
$10\times 0.179=1.79$ percentage points.
The constant term $\alpha$ is a necessary part of the model, but it is
almost never of interest in itself. This is because the expected value
of $Y$ at $X=0$ is rarely specifically interesting. Very often $X=0$ is
also unrealistic, as in our example where it corresponds to a country
with zero primary school enrolment. There are fortunately no such
countries in the data, where the lowest School enrolment is 30%. It is
then of no interest to discuss expected IMR for a hypothetical country
where no children went to school. Doing so would also represent
unwarranted *extrapolation* of the model beyond the range of the
observed data. Even though the estimated linear model seems to fit
reasonably well for these data, this is no guarantee that it would do so
also for countries with much lower school enrolment, even if they
existed.
The third parameter of the simple regression model is $\sigma^{2}$. This
is the variance of the conditional distribution of $Y$ given $X$. It is
also known as the **conditional variance** of $Y$, the **error
variance** or the **residual variance**. Similarly, its square root
$\sigma$ is known as the conditional, error or **residual standard
deviation**. To understand $\sigma$, let us consider a single value of
$X$, such as one corresponding to one of the vertical dashed lines in
Figure \@ref(fig:f-linmod-params) or, say, school enrolment of 85 in Figure
\@ref(fig:f-imr1). The model specifies a distribution for $Y$ given any such
value of $X$. If we were to (hypothetically) collect a large number of
observations, all with this same value of $X$, the distribution of $Y$
for them would describe the conditional distribution of $Y$ given that
value of $X$. The model states that the average of these values,
i.e. the conditional mean of $Y$, is $\alpha+\beta X$, which is the
point on the regression line corresponding to $X$. The individual values
of $Y$, however, would of course not all be on the line but somewhere
around it, some above and some below.
The linear regression model further specifies that the form of the
conditional distribution of $Y$ is approximately normal. You can try to
visualise this by imagining a normal probability curve (c.f. Figure
\@ref(fig:f-norm1)) on the vertical line from $X$, centered on the regression
line and sticking up from the page. The bell shape of the curve
indicates that most of the values of $Y$ for a given $X$ will be close
to the regression line, and only small proportions of them far from it.
The residual standard deviation $\sigma$ is the standard deviation of
this conditional normal distribution, in essence describing how tightly
concentrated values of $Y$ tend to be around the regression line. The
model assumes, mainly for simplicity, that the same value of $\sigma$
applies to the conditional distributions at all values of $X$; this is
known as the assumption of *homoscedasticity*.
In SPSS output, an estimate of $\sigma$ is given in the “**Model
Summary**” table under the misleading label “Std. Error of the
Estimate”. An estimate of the residual variance $\sigma^{2}$ is found
also in the “**ANOVA**” table under “Mean Square” for “Residual”. In our
example the estimate of $\sigma$ is 2.6173 (and that of $\sigma^{2}$ is
6.85). This is usually not of direct interest for interpretation, but it
will be a necessary component of some parts of the analysis discussed
below.
### Estimation of the parameters {#ss-regression-simple-est}
Since the regression coefficients $\alpha$ and $\beta$ and the residual
standard deviation $\sigma$ are unknown population parameters, we will
need to use the observed data to obtain sensible estimates for them. How
to do so is now less obvious than in the cases of simple means and
proportions considered before. This section explains the standard method
of estimation for the parameters of linear regression models.
We will denote estimates of $\alpha$ and $\beta$ by $\hat{\alpha}$ and
$\hat{\beta}$ (“alpha-hat” and “beta-hat”) respectively (other notations
are also often used, e.g. $a$ and $b$). Similarly, we can define
$$\hat{Y}=\hat{\alpha}+\hat{\beta} X$$ for $Y$ given any value of $X$.
These are the values on the estimated regression line. They are known as
**fitted values** for $Y$, and estimating the parameters of the
regression model is often referred to as “fitting the model” to the
observed data. The fitted values represent our predictions of expected
values of $Y$ given $X$, so they are also known as **predicted values**
of $Y$.
In particular, fitted values $\hat{Y}_{i}=\hat{\alpha}+\hat{\beta}X_{i}$
can be calculated at the values $X_{i}$ of the explanatory variable $X$
for each unit $i$ in the observed sample. These can then be compared to
the correponding values $Y_{i}$ of the response variable. Their
differences $Y_{i}-\hat{Y}_{i}$ are known as the (sample) **residuals**.
These quantities are illustrated in Figure \@ref(fig:f-residuals). This shows
a fitted regression line, which is in fact the one for IMR given School
enrolment also shown in Figure \@ref(fig:f-imr1). Also shown are two points
$(X_{i}, Y_{i})$. These are also from Figure \@ref(fig:f-imr1); the rest have
been omitted to simplify the plot. The point further to the left is the
one for Mali, which has School enrolment $X_{i}=43.0$ and IMR
$Y_{i}=14.1$. Using the estimated coefficients $\hat{\alpha}=19.736$ and
$\hat{\beta}=-0.179$ in Figure \@ref(fig:f-spss-linreg), the fitted value for
Mali is $\hat{Y}_{i}=19.736-0.179\times 43.0=12.0$. Their difference is
the residual $Y_{i}-\hat{Y}_{i}=14.1-12.0=2.1$. Because the observed
value is here larger than the fitted value, the residual is positive and
the observed value is above the fitted line, as shown in Figure
\@ref(fig:f-residuals).
![(\#fig:f-residuals)Illustration of the quantities involved in the definitions of least squares estimates and the coefficient of determination $R^{2}$. See the text for explanation.](lmresids){width="13.5cm"}
<!--
$Y_{i}-\hat{Y}_{i}$
$Y_{i}-\hat{Y}_{i}$
$\bar{Y}$
$\hat{Y}_{i}-\bar{Y}$
$Y_{i}-\bar{Y}$
$\hat{Y}=\hat{\alpha}+\hat{\beta} X$
-->
The second point shown in Figure \@ref(fig:f-residuals) corresponds to the
observation for Ghana, for which $X_{i}=58.0$ and $Y_{i}=5.7$. The