-
Notifications
You must be signed in to change notification settings - Fork 7
/
07-MY451-means.Rmd
1850 lines (1639 loc) · 102 KB
/
07-MY451-means.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Analysis of population means {#c-means}
## Introduction and examples {#s-means-intro}
This chapter introduces some basic methods of analysis for continuous,
interval-level variables. The main focus is on statistical inference on
population *means* of such variables, but some new methods of
descriptive statistics are also described. The discussion draws on the
general ideas that have already been explaned for inference in Chapters
\@ref(c-tables) and \@ref(c-probs), and for continuous distributions in
Chapter \@ref(c-contd). Few if any new concepts thus need to be
introduced here. Instead, this chapter can focus on describing the
specifics of these very commonly used methods for continuous variables.
As in Chapter \@ref(c-probs), questions on both a single group and on
comparisons between two groups are discussed here. Now, however, the
main focus is on the two-group case. There we treat the group as the
explanatory variable $X$ and the continuous variable of interest as the
response variable $Y$, and assess the possible associations between $X$
and $Y$ by comparing the distributions (and especially the means) of $Y$
in the two groups.
The following five examples will be used for illustration throughout
this chapter. Summary statistics for them are shown in Table
\@ref(tab:t-groupex).
**Example 7.1: Survey data on diet**
The National Diet and Nutrition Survey of adults aged 19–64 living in
private households in Great Britain was carried out in 2000–01.^[Conducted for the Food Standards Agency and the Department of
Health by ONS and MRC Human Nutrition Research. The sample
statistics used here are from the survey reports published by HMSO
in 2002-04, aggregating results published separately for men and
women. The standard errors have been adjusted for non-constant
sampling probabilities using design factors published in the survey
reports. We will treat these numbers as if they were from a simple
random sample.] One
part of the survey was a food diary where the respondents recorded all
food and drink they consumed in a seven-day period. We consider two
variables derived from the diary: the consumption of fruit and
vegetables in portions (of 400g) per day (with mean in the sample of
size $n=1724$ of $\bar{Y}=2.8$, and standard deviation $s=2.15$), and
the percentage of daily food energy intake obtained from fat and fatty
acids ($n=1724$, $\bar{Y}=35.3$, and $s=6.11$).
-----------------------------------------------------------------------------------------------------------------------------
**One sample** $n$ $\bar{Y}$ $s$ Diff.
------------------------------------------------------------------------------- -------- -------------- --------- -----------
*Example 7.1: Variables from the National Diet and Nutrition Survey*
Fruit and vegetable consumption (400g portions) 1724 2.8 2.15
Total energy intake from fat (%) 1724 35.3 6.11
\newline
**Two independent samples**
*Example 7.2: Average weekly hours spent on housework*
Men 635 7.33 5.53
Women 469 8.49 6.14 1.16
\newline
*Example 7.3: Perceived friendliness of a police officer*
No sunglasses 67 8.23 2.39
Sunglasses 66 6.49 2.01 -1.74
\newline
**Two dependent samples**
*Example 7.4: Father's personal well-being*
Sixth month of wife’s pregnancy 109 30.69
One month after the birth 109 30.77 2.58 0.08
\newline
*Example 7.5: Traffic flows on successive Fridays*
Friday the 6th 10 128,385
Friday the 13th 10 126,550 1176 -1835
-----------------------------------------------------------------------------------------------------------------------------
:(\#tab:t-groupex)Examples of analyses of population means used in Chapter \@ref(c-means). Here $n$ and $\bar{Y}$ denote the sample size and sample mean respectively, in the two-group examples 7.2–7.5 separately for the two groups. “Diff.” denotes the between-group difference of means, and $s$ is the sample standard deviation of the response variable $Y$ for the whole sample (Example 7.1), of the response variable within each group (Examples 7.2 and 7.3), or of the within-pair differences (Examples 7.4 and 7.5).
**Example 7.2: Housework by men and women**
This example uses data from the 12th wave of the British Household Panel
Survey (BHPS), collected in 2002. BHPS is an ongoing survey of UK
households, measuring a range of socioeconomic variables. One of the
questions in 2002 was
*“About how many hours do you spend on housework in an average week,
such as time spent cooking, cleaning and doing the laundry?”*
The response to this question (recorded in whole hours) will be the
response variable $Y$, and the respondent’s sex will be the explanatory
variable $X$. We consider only those respondents who were less than 65
years old at the time of the interview and who lived in single-person
households (thus the comparisons considered here will not involve
questions of the division of domestic work within families).^[The data were obtained from the UK Data Archive. Three respondents
with outlying values of the housework variable (two women and one
man, with 50, 50 and 70 reported weekly hours) have been omitted
from the analysis considered here.]
We can indicate summary statistics separately for the two groups by
using subscripts 1 for men and 2 for women (for example). The sample
sizes are $n_{1}=635$ for men and $n_{2}=469$ for women, and the sample
means of $Y$ are $\bar{Y}_{1}=7.33$ and $\bar{Y}_{2}=8.49$. These and
the sample standard deviations $s_{1}$ and $s_{2}$ are also shown in
Table \@ref(tab:t-groupex).
**Example 7.3: Eye contact and perceived friendliness of police officers**
This example is based on an experiment conducted to examine the effects
of some aspects of the appearance and behaviour of police officers on
how members of the public perceive their encounters with the police.^[Boyanowsky, E. O. and Griffiths, C. T. (1982). “Weapons and eye
contact as instigators or inhibitors of aggressive arousal in
police-citizen interaction”. *Journal of Applied Social Psychology*,
**12**, 398–407.]
The subjects of the study were 133 people stopped by the Traffic Patrol
Division of a detachment of the Royal Canadian Mounted Police. When
talking to the driver who had been stopped, the police officer either
wore reflective sunglasses which hid his eyes, or wore no glasses at
all, thus permitting eye contact with the respondent. These two
conditions define the explanatory variable $X$, coded 1 if the officer
wore no glasses and 2 if he wore sunglasses. The choice of whether
sunglasses were worn was made at random before a driver was stopped.
While the police officer went back to his car to write out a report, a
researcher asked the respondent some further questions, one of which is
used here as the response variable $Y$. It is a measure of the
respondent’s perception of the friendliness of the police officer,
measured on a 10-point scale where large values indicate high levels of
friendliness.
The article describing the experiment does not report all the summary
statistics needed for our purposes. The statistics shown in Table
\@ref(tab:t-groupex) have thus been partially made up for use here. They are,
however, consistent with the real results from the study. In particular,
the direction and statistical significance of the difference between
$\bar{Y}_{2}$ and $\bar{Y}_{1}$ are the same as those in the published
report.
**Example 7.4: Transition to parenthood**
In a study of the stresses and feelings associated with parenthood, 109
couples expecting their first child were interviewed before and after
the birth of the baby.^[Miller, B. C. and Sollie, D. L. (1980). “Normal stresses during
the transition to parenthood”. *Family Relations*, **29**, 459–465.
See the article for further information, including results for the
mothers.] Here we consider only data for the fathers,
and only one of the variables measured in the study. This variable is a
measure of personal well-being, obtained from a seven-item attitude
scale, where larger values indicate higher levels of well-being.
Measurements of it were obtained for each father at three time points:
when the mother was six months pregnant, one month after the birth of
the baby, and six months after the birth. Here we will use only the
first two of the measurements. The response variable $Y$ will thus be
the measure of personal well-being, and the explanatory variable $X$
will be the time of measurement (sixth month of the pregnancy or one
month after the birth). The means of $Y$ at the two times are shown in
Table \@ref(tab:t-groupex). As in Example 7.3, not all of the numbers needed
here were given in the original article. Specifically, the standard
error of the difference in Table \@ref(tab:t-groupex) has been made up in
such a way that the results of a significance test for the mean
difference agree with those in the article.
**Example 7.5: Traffic patterns on Friday the 13th**
A common superstition regards the 13th day of any month falling on a
Friday as a particularly unlucky day. In a study examining the possible
effects of this belief on people’s behaviour,^[Scanlon, T. J. et al. (1993). “Is Friday the 13th bad for your
health?”. *British Medical Journal*, **307**, 1584–1586. The data
were obtained from The Data and Story Library at Carnegie Mellon
University (`lib.stat.cmu.edu/DASL`).] data were obtained on
the numbers of vehicles travelling between junctions 7 and 8 and
junctions 9 and 10 on the M25 motorway around London during every Friday
the 13th in 1990–92. For comparison, the same numbers were also recorded
during the previous Friday (i.e. the 6th) in each case. There are only
ten such pairs here, and the full data set is shown in Table
\@ref(tab:t-F13). Here the explanatory variable $X$ indicates whether a day
is Friday the 6th (coded as 1) or Friday the 13th (coded as 2), and the
response variable is the number of vehicles travelling between two
junctions.
Date Junctions Friday the 6th Friday the 13th Difference
---------------- ----------- ---------------- ----------------- ------------
July 1990 7 to 8 139246 138548 -698
July 1990 9 to 10 134012 132908 -1104
September 1991 7 to 8 137055 136018 -1037
September 1991 9 to 10 133732 131843 -1889
December 1991 7 to 8 123552 121641 -1911
December 1991 9 to 10 121139 118723 -2416
March 1992 7 to 8 128293 125532 -2761
March 1992 9 to 10 124631 120249 -4382
November 1992 7 to 8 124609 122770 -1839
November 1992 9 to 10 117584 117263 -321
:(\#tab:t-F13)Data for Example 7.5: Traffic flows between junctions of the M25 on
each Friday the 6th and Friday the 13th in 1990-92.
In each of these cases, we will regard the variable of interest $Y$ as a
continuous, interval-level variable. The five examples illustrate three
different situations considered in this chapter. Example 7.1 includes
two separate $Y$-variables (consumption of fruit and vegetables, and fat
intake), each of which is considered for a single population. Questions
of interest are about the mean of the variable in the population. This
is analogous to the one-group questions on proportions in Sections
\@ref(s-probs-test1sample) and \@ref(s-probs-1sampleci). In this chapter
the one-group case is discussed only relatively briefly, in Section
\@ref(s-means-1sample).
The main focus here is on the case illustrated by Examples 7.2 and 7.3.
These involve samples of a response variable (hours of housework, or
preceived friendliness) from two groups (men and women, or police with
or without sunglasses). We are then interested in comparing the
distributions, and especially the means, of the response variable
between the groups. This case will be discussed first. Descriptive
statistics for it are described in Section \@ref(s-means-descr), and
statistical inference in Section \@ref(s-means-inference).
Finally, examples 7.4 and 7.5 also involve comparisons between two
groups, but of a slightly different kind than examples 7.2 and 7.3. The
two types of cases differ in the nature of the two samples (groups)
being compared. \label{p-depsamples} In Examples 7.2 and 7.3, the
samples can be considered to be **independent**. What this claim means
will be discussed briefly later; informally, it is justified in these
examples because the subjects in the two groups are separate and
unrelated individuals. In Examples 7.4 and 7.5, in contrast, the samples
(before and after the birth of a child, or two successive Fridays) must
be considered **dependent**, essentially because they concern
measurements on the same units at two distinct times. This case is
discussed in Section \@ref(s-means-dependent).
In each of the four two-group examples we are primarily interested in
questions about possible association between the group variable $X$ and
the response variable $Y$. As before, this is the question of whether
the conditional distributions of $Y$ are different at the two levels of
$X$. There is thus an association between $X$ and $Y$ if
- Example 7.2: The distribution of hours of housework is different for
men than for women.
- Example 7.3: The distribution of perceptions of a police officer’s
friendliness is different when he is wearing mirrored sunglasses
than when he is not.
- Example 7.4: The distribution of measurements of personal well-being
is different at the sixth month of the pregnancy than one month
after the birth.
- Example 7.5: The distributions of the numbers of cars on the
motorway differ between Friday the 6th and the following Friday
the 13th.
We denote the two values of $X$, i.e. the two groups, by 1 and 2. The
mean of the population distribution of $Y$ given $X=1$ will be denoted
$\mu_{1}$ and the standard deviation $\sigma_{1}$, and the mean and
standard deviation of the population distribution given $X=2$ are
denoted $\mu_{2}$ and $\sigma_{2}$ similarly. The corresponding sample
quantities are the conditional sample means $\bar{Y}_{1}$ and
$\bar{Y}_{2}$ and sample standard deviations $s_{1}$ and $s_{2}$. For
inference, we will focus on the population difference
$\Delta=\mu_{2}-\mu_{1}$ which is estimated by the sample difference
$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}$. Some of the descriptive methods
described in Section \@ref(s-means-descr), on the other hand, also aim to
summarise and compare other aspects of the two conditional sample
distributions.
## Descriptive statistics for comparisons of groups {#s-means-descr}
### Graphical methods of comparing sample distributions {#ss-means-descr-graphs}
There is an association between the group variable $X$ and the response
variable $Y$ if the distributions of $Y$ in the two groups are not the
same. To determine the extent and nature of any such association, we
need to compare the two distributions. This section describes methods of
doing so for observed data, i.e. for examining associations in a sample.
We begin with graphical methods which can be used to detect differences
in any aspects of the two distributions. We then discuss some
non-graphical summaries which compare specific aspects of the sample
distributions, especially their means.
Although the methods of *inference* described later in this chapter will
be limited to the case where the group variable $X$ is dichotomous, many
of the descriptive methods discussed below can just as easily be applied
when more than two groups are being compared. This will be mentioned
wherever appropriate. For inference in the multiple-group case some of
the methods discussed in Chapter \@ref(c-regression) are applicable.
In Section \@ref(ss-descr1-1cont-graphs) we described four graphical
methods of summarizing the sample distribution of one continuous
variable $Y$: the histogram, the stem and leaf plot, the frequency
polygon and the box plot. Each of these can be adapted for comparisons
of two or more distributions, although some more conveniently than
others. We illustrate the use three of the plots for this purpose, using
the comparison of housework hours in Example 7.2 for illustration. Stem
and leaf plots will not be shown, because they are less appropriate when
the sample sizes are as large as they are in this example.
Two sample distributions can be compared by displaying histograms of
them side by side, as shown in Figure \@ref(fig:f-hworkpyramid). This is not
a very common type of graph, and not ideal for visually comparing the
two distributions, because the bars to be compared (here for men
vs. women) end at opposite ends of the plot. A better alternative is to
use frequency polygons. Since these represent a sample distribution by a
single line, it is easy to include two of them in the same plot, as
shown in Figure \@ref(fig:f-hworkpolygons). Finally, Figure
\@ref(fig:f-twoboxplots) shows two boxplots of reported housework hours, one
for men and one for women.
The plots suggest that the distributions are quite similar for men and
women. In both groups, the largest proportion of respondents stated that
they do between 4 and 7 hours of housework a week. The distributions are
clearly positively skewed, since the reported number of hours was much
higher than average for a number of people (whereas less than zero hours
were of course not recorded for anyone). The proportions of observations
in categories including values 5, 10, 15, 20, 25 and 30 tend to be
relatively high, suggesting that many respondents chose to report their
answers in such round numbers. The box plots show that the median number
of hours is higher for women than for men (7 vs. 6 hours), and women’s
responses have slightly less variation, as measured by both the IQR and
the range of the whiskers. Both distributions have several larger,
outlying observations (note that SPSS, which was used to produce Figure
\@ref(fig:f-twoboxplots), divides outliers into moderate and “extreme” ones;
the latter are observations more than 3 IQR from the end of the box, and
are plotted with asterisks).
![(\#fig:f-hworkpyramid)Histograms of the sample distributions of reported weekly hours of housework in Example 7.2, separately for men ($n=635$) and women ($n=469$).](hworkpyramid){width="130mm"}
![(\#fig:f-hworkpolygons)Frequency polygons of the sample distributions of reported weekly hours of housework in Example 7.2, separately for men and women. The points show the percentages of observations in the intervals of 0–3, 4–7, $\dots$, 32–35 hours (plus zero percentages at each end of the curve).](hwork){width="11.5cm"}
![(\#fig:f-twoboxplots)Box plots of the sample distributions of reported weekly hours of housework in Example 7.2, separately for men and women.](twoboxplots){width="11cm"}
Figures \@ref(fig:f-hworkpyramid)–\@ref(fig:f-twoboxplots) also illustrate an
important general point about such comparisons. Typically we focus on
comparing *means* of the conditional distributions. Here the difference
between the sample means is 1.16, i.e. women in the sample spend, on
average, over an hour longer on housework per week than men. The
direction of the difference could also be guessed from Figure
\@ref(fig:f-hworkpolygons), which shows that somewhat smaller proportions of
women than of men report small numbers of hours, and larger proportions
of women report large numbers. This difference will later be shown to be
statistically significant, and it is also arguably relatively large in a
substantive sense.
However, it is equally important to note that the two distributions
summarized by the graphs are nevertheless largely similar. For example,
even though the mean is higher for women, there are clearly many women
who report spending hardly any time on housework, and many men who spend
a lot of time on it. In other words, the two distributions overlap to a
large extent. This obvious point is often somewhat neglected in public
discussions of differences between groups such as men and women or
different ethnic groups. It is not uncommon to see reports of research
indicating that (say) men have higher or lower values of something or
other then women. Such statements usually refer to differences of
averages, and are often clearly important and interesting. Less helpful,
however, is the tendency to discuss the differences almost as if the
corresponding distributions had no overlap at all, i.e. as if *all* men
were higher or lower in some characteristic than all women. This is
obviously hardly ever the case.
Box plots and frequency polygons can also be used to compare more than
two sample distributions. For example, the experimental conditions in
the study behind Example 7.3 actually involved not only whether or not a
police officer wore sunglasses, but also whether or not he wore a gun.
Distributions of perceived friendliness given all four combinations of
these two conditions could easily be summarized by drawing four box
plots or frequency polygons in the same plot, one for each experimental
condition.
### Comparing summary statistics {#ss-means-descr-tables}
Main features of sample distributions, such as their central tendencies
and variations, are described using the summary statistics introduced in
Section \@ref(s-descr1-nums). These too can be compared between groups.
Table \@ref(tab:t-groupex) shows such statistics for the examples of this
chapter. Tables like these are routinely reported for initial
description of data, even if more elaborate statistical methods are
later used.
Sometimes the association between two variables in a sample is
summarized in a single *measure of association* calculated from the
data. This is especially convenient when both of the variables are
continuous (in which case the most common measure of association is
known as the *correlation* coefficient). In this section we consider as
such a summary the difference $\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}$ of
the sample means of $Y$ in the two groups. These differences are also
shown in Table \@ref(tab:t-groupex).
The difference of means is important because it is also the focus of the
most common methods of inference for two-group comparisons. For purely
descriptive purposes it may be as or more convenient to report some
other statistic. For example, the difference of means of 1.16 hours in
Example 7.2 could also be described in *relative* terms by saying that
the women’s average is about 16 per cent higher than the men’s average
(because $1.16/7.33=0.158$, i.e. the difference represents 15.8 % of the
men’s average).
## Inference for two means from independent samples {#s-means-inference}
### Aims of the analysis {#ss-means-inference-intro}
Formulated as a statistical model in the sense discussed on page in Section \@ref(ss-contd-probdistrs-general), the
assumptions of the analyses considered in this section are as follows:
1. \label{p-2sample} We have a sample of $n_{1}$ independent
observations of a variable $Y$ in group 1, which have a population
distribution with mean $\mu_{1}$ and standard deviation
$\sigma_{1}$.
2. We have a sample of $n_{2}$ independent observations of $Y$ in group
2, which have a population distribution with mean $\mu_{2}$ and
standard deviation $\sigma_{2}$.
3. The two samples are independent, in the sense discussed following Example 7.5.
4. For now, we further assume that the population standard deviations
$\sigma_{1}$ and $\sigma_{2}$ are equal, with a common value denoted
by $\sigma$. This relatively minor assumption will be discussed
further in Section \@ref(ss-means-inference-variants).
We could have stated the starting points of the analyses in Chapters
\@ref(c-tables) and \@ref(c-probs) also in such formal terms. It is not
absolutely necessary to always do so, but we should at least remember
that any statistical analysis is based on some such model. In
particular, this helps to make it clear what our methods of analysis do
and do not assume, so that we may critically examine whether these
assumptions appear to be justified for the data at hand.
The model stated above does not require that the population
distributions of $Y$ should have the form of any particular probability
distribution. It is often further assumed that these distributions are
normal distributions, but this is not essential. Discussion of this
question is postponed until Section \@ref(ss-means-inference-variants).
The only new term in this model statement was the “independent” under
assumptions 1 and 2. This statistical term can be roughly translated as
“unrelated”. The condition can usually be regarded as satisfied when the
units of analysis are different entities, as in Examples 7.2 and 7.3
where the units within each group are distinct individual people. In
these examples the individuals in the two groups are also distinct, from
which it follows that the two *samples* are independent as required by
assumption 3. The same assumption of independent observations is also
required by all of the methods described in Chapters \@ref(c-tables) and
\@ref(c-probs), although we did not state this explicitly there.
This situation is illustrated by Example 7.2, where $Y$ is the number of
hours a person spends doing housework in a week, and the two groups are
men (group 1) and women (group 2).
The quantity of main interest is here the difference of population means
\begin{equation}
\Delta=\mu_{2}-\mu_{1}.
(\#eq:DeltaB)
\end{equation}
In particular, if $\Delta=0$, the population means in
the two groups are the same. If $\Delta\ne 0$, they are not the same,
which implies that there is an association between $Y$ and the group in
the population.
Inference on $\Delta$ can be carried out using methods which are
straightforward modifications of the ones introduced first in Chapter
\@ref(c-probs). For significance testing, the null hypothesis of interest
is
\begin{equation}
H_{0}: \; \Delta=0,
(\#eq:mH0a)
\end{equation}
to be tested against a two-sided ($H_{a}:\; \Delta\ne 0$)
or one-sided ($H_{a}:\; \Delta> 0$ or $H_{a}:\; \Delta< 0$) alternative
hypothesis. The test statistic used to test (\@ref(eq:mH0a)) is again of the
form
\begin{equation}
t=\frac{\hat{\Delta}}{\hat{\sigma}_{\hat{\Delta}}}
(\#eq:tma)
\end{equation}
where $\hat{\Delta}$ is a sample estimate of $\Delta$, and
$\hat{\sigma}_{\hat{\Delta}}$ its estimated standard error. Here the
statistic is conventionally labelled $t$ rather than $z$ and called the
*t-test statistic* because sometimes the $t$-distribution rather than
the normal is used as its sampling distribution. This possibility is
discussed in Section \@ref(ss-means-inference-variants), and we can
ignore it until then.
Confidence intervals for the differences $\Delta$ are also of the
familiar form
\begin{equation}
\hat{\Delta} \pm z_{\alpha/2}\, \hat{\sigma}_{\hat{\Delta}}
(\#eq:ciDpa)
\end{equation}
where $z_{\alpha/2}$ is the appropriate multiplier from
the standard normal distribution to obtain the required confidence
level, e.g. $z_{0.025}=1.96$ for 95% confidence intervals. The
multiplier is replaced with a slightly different one if the
$t$-distribution is used as the sampling distribution, as discussed in
Section \@ref(ss-means-inference-variants).
The details of these formulas in the case of two-sample inference on
means are described next, in Section \@ref(ss-means-inference-test) for
the significance test and in Section \@ref(ss-means-inference-ci) for the
confidence interval.
### Significance testing: The two-sample t-test {#ss-means-inference-test}
For tests of the difference of means $\Delta=\mu_{2}-\mu_{1}$ between
two population distributions, we consider the null hypothesis of no
difference
\begin{equation}
H_{0}: \; \Delta=0.
(\#eq:H0m)
\end{equation}
In the housework example, this is the hypothesis that
average weekly hours of housework in the population are the same for men
and women. It is tested against an alternative hypothesis, either the
two-sided alternative hypotheses
\begin{equation}
H_{a}: \; \Delta\ne 0
(\#eq:Hatwom)
\end{equation}
or one of the one-sided alternative hypotheses
$$H_{a}: \Delta> 0 \text{ or } H_{a}: \Delta< 0$$ In the discussion below, we concentrate on the more
common two-sided alternative.
The test statistic for testing (\@ref(eq:H0m)) is of the general form
(\@ref(eq:tma)). Here it depends on the data only through the sample means
$\bar{Y}_{1}$ and $\bar{Y}_{2}$ and sample variances $s_{1}^{2}$ and
$s_{2}^{2}$ of $Y$ in the two groups. A point estimate of $\Delta$ is
\begin{equation}
\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}.
(\#eq:Dhatmu)
\end{equation}
In terms of the population parameters, the standard
error of $\hat{\Delta}$ is
\begin{equation}
\sigma_{\hat{\Delta}}=\sqrt{\sigma^{2}_{\bar{Y}_{2}}+\sigma^{2}_{\bar{Y}_{1}}}=\sqrt{\frac{\sigma^{2}_{2}}{n_{2}}+\frac{\sigma^{2}_{1}}{n_{1}}}.
(\#eq:sigmaDmu)
\end{equation}
When we assume that the population standard
deviations $\sigma_{1}$ and $\sigma_{2}$ are equal, with a common value
$\sigma$, (\@ref(eq:sigmaDmu)) simplifies to
\begin{equation}
\sigma_{\hat{\Delta}} =\sigma\; \sqrt{\frac{1}{n_{2}}+\frac{1}{n_{1}}}.
(\#eq:seDpop)
\end{equation}
The formula of the test statistic uses an estimate of
this standard error, given by
\begin{equation}
\hat{\sigma}_{\hat{\Delta}} =\hat{\sigma} \; \sqrt{\frac{1}{n_{2}}+\frac{1}{n_{1}}}
(\#eq:seD2)
\end{equation}
where $\hat{\sigma}$ is an estimate of $\sigma$,
calculated from
\begin{equation}
\hat{\sigma}=\sqrt{\frac{(n_{2}-1)s^{2}_{2}+(n_{1}-1)s^{2}_{1}}{n_{1}+n_{2}-2}}.
(\#eq:sehatjoint)
\end{equation}
Substituting (\@ref(eq:Dhatmu)) and (\@ref(eq:seD2)) into
the general formula (\@ref(eq:tma)) gives the **two-sample t-test statistic
for means**
\begin{equation}
t=\frac{\bar{Y}_{2}-\bar{Y}_{1}}
{\hat{\sigma}\, \sqrt{1/n_{2}+1/n_{1}}}
(\#eq:ztestmuDb)
\end{equation}
where $\hat{\sigma}$ is given by (\@ref(eq:sehatjoint)).
For an illustration of the calculations, consider again the housework
Example 7.2. Here, denoting men by 1 and women by 2, $n_{1}=635$,
$n_{2}=469$, $\bar{Y}_{1}=7.33$, $\bar{Y}_{2}=8.49$, $s_{1}=5.53$ and
$s_{2}=6.14$. The estimated mean difference is thus
$$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}=8.49-7.33=1.16.$$ The common
value of the population standard deviation $\sigma$ is estimated from
(\@ref(eq:sehatjoint)) as $$\begin{aligned}
\hat{\sigma}&=&
\sqrt{\frac{(n_{2}-1)s^{2}_{2}+(n_{1}-1)s^{2}_{1}}{n_{1}+n_{2}-2}}
=
\sqrt{\frac{(469-1) 6.14^{2}+(635-1) 5.53^{2}}{635+469-2}}\\
&=& \sqrt{33.604}=5.797\end{aligned}$$ and the estimated standard error
of $\hat{\Delta}$ is given by (\@ref(eq:seD2)) as
$$\hat{\sigma}_{\hat{\Delta}} =
\hat{\sigma} \; \sqrt{\frac{1}{n_{2}}+\frac{1}{n_{1}}}
=5.797 \; \sqrt{\frac{1}{469}+\frac{1}{635}}=0.353.$$ The value of the
t-test statistic (\@ref(eq:ztestmuDb)) is then obtained as
$$t=\frac{1.16}{0.353}=3.29.$$ These values and other quantities
explained later, as well as similar results for Example 7.3, are also
shown in Table \@ref(tab:t-2testsY1).
----------------------------------------------------------------------------------------------------------------
$\hat{\Delta}$ $\hat{\sigma}_{\hat{\Delta}}$ $t$ $P$-value 95 % C.I.
----------------------------------- ------------------------------- --------- ----------- ----------------------
Example 7.2: Average weekly hours
spent on housework
1.16 0.353 3.29 0.001 (0.47; 1.85)
Example 7.3: Perceived friendliness
of a police officer
$-1.74$ 0.383 $-4.55$ $<0.001$ $(-2.49; -0.99)$
----------------------------------------------------------------------------------------------------------------
:(\#tab:t-2testsY1)Results of tests and confidence intervals for comparing means for
two independent samples. For Example 7.2, the difference of means is
between women and men, and for Example 7.3, it is between wearing and
not wearing sunglasses. The test statistics and confidence intervals
are obtained under the assumption of equal population standard
deviations, and the $P$-values are for a test with a two-sided
alternative hypothesis. See the text for the definitions of the
statistics.
\label{p-spss2a} If necessary, calculations like these can be carried
out even with a pocket calculator. It is, however, much more convenient
to leave them to statistical software. Figure \@ref(fig:f-spss2test) shows
SPSS output for the two-sample t-test for the housework data. The first
part of the table, labelled “Group Statistics”, shows the sample sizes
$n$, means $\bar{Y}$ and standard deviations $s$ separately for the two
groups. The quantity labelled “Std. Error Mean” is $s/\sqrt{n}$. This is
an estimate of the standard error of the sample mean, which is the
quantity $\sigma/\sqrt{n}$ discussed in Section \@ref(s-contd-clt).
The second part of the table in Figure \@ref(fig:f-spss2test), labelled
“Independent Samples Test”, gives results for the t-test itself. The
test considered here, which assumes a common population standard
deviation $\sigma$ (and thus also variance $\sigma^{2}$), is found on
the row labelled “Equal variances assumed”. The test statistic is shown
in the column labelled “$t$”, and the difference
$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}$ and its standard error
$\hat{\sigma}_{\hat{\Delta}}$ are shown in the “Mean Difference” and
“Std. Error Difference” columns respectively. Note that the difference
($-1.16$) has been calculated in SPSS between men and women rather than
vice versa as in Table \@ref(tab:t-2testsY1), but this will make no
difference to the conclusions from the test.
![(\#fig:f-spss2test)SPSS output for a two-sample $t$-test in Example 7.2, comparing average weekly hours spent on housework between men and women.](spss2t){width="17cm"}
In the two-sample situation with assumptions 1–4 at the beginning of Section \@ref(ss-means-inference-intro), the sampling distribution of the t-test
statistic (\@ref(eq:ztestmuDb)) is approximately a standard normal
distribution when the null hypothesis
$H_{0}: \; \Delta=\mu_{2}-\mu_{1}=0$ is true in the population and the
sample sizes are large enough. This is again a consequence of the
Central Limit Theorem. The requirement for “large enough” sample sizes
is fairly easy to satisfy. A good rule of thumb is that the sample sizes
$n_{1}$ and $n_{2}$ in the two groups should both be at least 20 for the
sampling distribution of the test statistic to be well enough
approximated by the standard normal distribution. In the housework
example we have data on 635 men and 469 women, so the sample sizes are
clearly large enough. A variant of the test which relaxes the condition
on the sample sizes is discussed in Section
\@ref(ss-means-inference-variants) below.
The $P$-value of the test is calculated from this sampling distribution
in exactly the same way as for the tests of proportions in Section
\@ref(ss-probs-test1sample-samplingd). In the housework example the value
of the $t$-test statistic is $t=3.29$. The $P$-value for testing the
null hypothesis against the two-sided alternative (\@ref(eq:Hatwom)) is then
the probability, calculated from the standard normal distribution, of
values that are at least 3.29 or at most $-3.29$. Each of these two
probabilities is about 0.0005, so the $P$-value is
$0.0005+0.0005=0.001$. In the SPSS output of Figure \@ref(fig:f-spss2test) it
is given in the column labelled “Sig. (2-tailed)”, where “Sig.” is short
for “significance” and “2-tailed” is a synonym for “2-sided”.
The $P$-value can also be calculated approximately using the table of
the standard normal distribution (see Table \@ref(tab:t-ttable), as explained in Section
\@ref(ss-probs-test1sample-samplingd). Here the test statistic $t=3.29$,
which is larger than the critical values 1.65, 1.96 and 2.58 for the
0.10, 0.05 and 0.01 significance levels for a two-sided test, so we can
report that $P<0.01$. Here $t$ is by chance actually equal (to two
decimal places) to the critical value for the 0.001 significance level,
so we could also report $P=0.001$. These findings agree, as they should,
with the exact $P$-value of 0.001 shown in the SPSS output.
In conclusion, the two-sample $t$-test in Example 7.2 indicates that
there is very strong evidence (with $P=0.001$ for the two-sided test)
against the claim that the hours of weekly housework are on average the
same for men and women in the population.
Here we showed raw SPSS output in Figure \@ref(fig:f-spss2test) because we
wanted to explain its contents and format. Note, however, that such
unedited computer output is rarely if ever appropriate in research
reports. Instead, results of statistical analyses should be given in
text or tables formatted in appropriate ways for presentation. See Table
\@ref(tab:t-2testsY1) and various other examples in this coursepack and
textbooks on statistics.
To summarise the elements of the test again, we repeat them briefly, now
for Example 7.3, the experiment on the effect of eye contact on the
perceived friendliness of police officers (c.f. Table \@ref(tab:t-groupex) for the summary statistics):
1. Data: samples from two groups, one with the experimental condition
where the officer wore no sunglasses, with sample size $n_{1}=67$,
mean $\bar{Y}_{1}=8.23$ and standard deviation $s_{1}=2.39$, and the
second with the experimental condition where the officer did wear
sunglasses, with $n_{2}=66$, $\bar{Y}_{2}=6.49$ and $s_{2}=2.01$.
2. Assumptions: the observations are random samples of statistically
independent observations from two populations, one with mean
$\mu_{1}$ and standard deviation $\sigma_{1}$, and the other with
with mean $\mu_{2}$ and the same standard deviation $\sigma_{2}$,
where the standard deviations are equal, with value
$\sigma=\sigma_{1}=\sigma_{2}$. The sample sizes $n_{1}$ and $n_{2}$
are sufficiently large, say both at least 20, for the sampling
distribution of the test statistic under the null hypothesis to be
approximately standard normal.
3. Hypotheses: These are about the difference of the population means
$\Delta=\mu_{2}-\mu_{1}$, with null hypothesis $H_{0}: \Delta=0$.
The two-sided alternative hypothesis $H_{a}: \Delta\ne 0$ is
considered in this example.
4. The test statistic: the two-sample $t$-statistic
$$t=\frac{\hat{\Delta}}{\hat{\sigma}_{\hat{\Delta}}}=
\frac{-1.74}{0.383}=-4.55$$ where
$$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}=6.49-8.23=-1.74$$ and
$$\hat{\sigma}_{\hat{\Delta}}=
\hat{\sigma} \; \sqrt{\frac{1}{n_{2}}+\frac{1}{n_{1}}}
=2.210 \times \sqrt{
\frac{1}{66}+\frac{1}{67}}=0.383$$ with $$\hat{\sigma}=
\sqrt{\frac{(n_{2}-1)s^{2}_{2}+(n_{1}-1)s^{2}_{1}}{n_{1}+n_{2}-2}}
=
\sqrt{\frac{65\times 2.01^{2}+66\times 2.39^{2}}{131}}
=2.210$$
5. The sampling distribution of the test statistic when $H_{0}$ is
true: approximately the standard normal distribution.
6. The $P$-value: the probability that a randomly selected value from
the standard normal distribution is at most $-4.55$ or at least
4.55, which is about 0.000005 (reported as $P<0.001$).
7. Conclusion: A two-sample $t$-test indicates very strong evidence
that the average perceived level of the friendliness of a police
officer is different when the officer is wearing reflective
sunglasses than when the officer is not wearing such glasses
($P<0.001$).
### Confidence intervals for a difference of two means {#ss-means-inference-ci}
A confidence interval for the mean difference $\Delta=\mu_{1}-\mu_{2}$
is obtained by substituting appropriate expressions into the general
formula (\@ref(eq:ciDpa)). Specifically, here
$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}$ and a 95% confidence interval for
$\Delta$ is
\begin{equation}
(\bar{Y}_{2}-\bar{Y}_{1}) \pm 1.96\; \hat{\sigma} \;\sqrt{\frac{1}{n_{2}}+\frac{1}{n_{1}}}
(\#eq:ciDmu2)
\end{equation}
where $\hat{\sigma}$ is obtained from equation
\@ref(eq:sehatjoint). The validity of this again requires that the sample
sizes $n_{1}$ and $n_{2}$ from both groups are reasonably large, say
both at least 20. For the housework Example 7.2, the 95% confidence
interval is $$1.16\pm 1.96\times 0.353 = 1.16 \pm 0.69 = (0.47; 1.85)$$
using the values of $\bar{Y}_{2}-\bar{Y}_{1}$ and its standard error
calculated earlier. This interval is also shown in Table
\@ref(tab:t-2testsY1) and in the SPSS output in
Figure \@ref(fig:f-spss2test) \label{p-spss2c}.
In the latter, the interval is given as (-1.85; -0.47) because it is
expressed for the difference defined in the opposite direction (men $-$
women instead of vice versa). For Example 7.3, the 95% confidence
interval is $-1.74\pm 1.96\times 0.383=(-2.49;
-0.99)$.
Based on the data in Example 7.2 we are thus 95 % confident that the
difference between women’s and men’s average hours of reported weekly
housework in the population is between 0.47 and 1.85 hours. In
substantive terms this interval, from just under half an hour to nearly
two hours, is arguably fairly wide in that its two end points might well
be regarded as substantially different from each other. The difference
between women’s and men’s average housework hours is thus estimated
fairly imprecisely from this survey.
### Variants of the test and confidence interval {#ss-means-inference-variants}
#### Allowing unequal population variances {-}
The two-sample $t$-test and confidence interval for the difference of
means were stated above under the assumption that the standard
deviations $\sigma_{1}$ and $\sigma_{2}$ of the variable of interest $Y$
are the same in both of the two groups being compared. This assumption
is not in fact essential. If it is omitted, we obtain formulas which
differ from the ones discussed above only in one part of the
calculations.
Suppose that we do allow the unknown values of $\sigma_{1}$ and
$\sigma_{2}$ to be different from each other. In other words, we
consider the model stated at the beginning of Section \@ref(ss-means-inference-intro), without
assumption 4 that $\sigma_{1}=\sigma_{2}$. The test statistic is then
still of the same form as before,
i.e. $t=\hat{\Delta}/\hat{\sigma}_{\hat{\Delta}}$, with
$\hat{\Delta}=\bar{Y}_{2}-\bar{Y}_{1}$. The only change in the
calculations is that the estimate of the standard error of
$\hat{\Delta}$, the formula of which is given by equation
(\@ref(eq:sigmaDmu)), now uses separate estimates
of $\sigma_{1}$ and $\sigma_{2}$. The obvious choices for these are the
corresponding sample standard deviations, $s_{1}$ for $\sigma_{1}$ and
$s_{2}$ for $\sigma_{2}$. This gives the estimated standard error as
\begin{equation}
\hat{\sigma}_{\hat{\Delta}}=\sqrt{\frac{s_{2}^{2}}{n_{2}}+\frac{s_{1}^{2}}{n_{1}}}.
(\#eq:seDmu-ne)
\end{equation}
Substituting this to the formula of the test
statistic yields the two-sample $t$-test statistic without the
assumption of equal population standard deviations,
\begin{equation}
t=\frac{\bar{Y}_{2}-\bar{Y}_{1}}{\sqrt{s^{2}_{2}/n_{2}+s^{2}_{1}/n_{1}}}.
(\#eq:ztestmuD)
\end{equation}
The sampling distribution of this under the null
hypothesis is again approximately a standard normal distribution when
the sample sizes $n_{1}$ and $n_{2}$ are both at least 20. The $P$-value
for the test is obtained in exactly the same way as before, and the
principles of interpreting the result of the test are also unchanged.
For the confidence interval, the only change from Section
\@ref(ss-means-inference-ci) is again that the estimated standard error
is changed, so for a 95% confidence interval we use
\begin{equation}
(\bar{Y}_{2}-\bar{Y}_{1}) \pm 1.96 \;\sqrt{\frac{s^{2}_{2}}{n_{2}}+\frac{s^{2}_{1}}{n_{1}}}.
(\#eq:ciDmu)
\end{equation}
In the housework example 7.2, the estimated standard error
(\@ref(eq:seDmu-ne)) is $$\hat{\sigma}_{\hat{\Delta}}=
\sqrt{
\frac{6.14^{2}}{469}+
\frac{5.53^{2}}{635}
}=
\sqrt{0.1285}=0.359,$$ the value of the test statistic is
$$t=\frac{1.16}{0.359}=3.23,$$ and the two-sided $P$-value is now
$P=0.001$. Recall that when the population standard deviations were
assumed to be equal, we obtained $\hat{\sigma}_{\hat{\Delta}}=0.353$,
$t=3.29$ and again $P=0.001$. The two sets of results are thus very
similar, and the conclusions from the test are the same in both cases.
The differences between the two variants of the test are even smaller in
Example 7.3, where the estimated standard error
$\hat{\sigma}_{\hat{\Delta}}=0.383$ is the same (to three decimal
places) in both cases, and the results are thus identical.^[In this case this is a consquence of the fact that the sample
sizes (67 and 66) in the two groups are very similar. When they are
exactly equal, formulas (\@ref(eq:sehatjoint))–(\@ref(eq:ztestmuDb)) and
(\@ref(eq:seDmu-ne)) actually give exactly the same value for the
standard error $\hat{\sigma}_{\hat{\Delta}}$, and $t$ is thus also
the same for both variants of the test.] In both
examples the confidence intervals obtained from (\@ref(eq:ciDmu2)) and
(\@ref(eq:ciDmu)) are also very similar. Both variants of the two-sample
analyses are shown in SPSS output (c.f. Figure \@ref(fig:f-spss2test)), the ones assuming equal population standard
deviations on the row labelled “Equal variances assumed” and the one
without this assumption on the “Equal variances not assumed” row.^[The output also shows, under “Levene’s test”, a test statistic and
$P$-value for testing the hypothesis of equal standard deviations
($H_{0}: \,
\sigma_{1}=\sigma_{2}$). However, we prefer not to rely on this
because the test requires the additional assumption that the
population distributions are normal, and is very sensitive to the
correctness of this assumption.]
Which methods should we then use, the ones with or without the
assumption of equal population variances? In practice the choice rarely
makes much difference, and the $P$-values and conclusions from the two
versions of the test are typically very similar.^[In the MY451 examination and homework, for example, both variants
of the test are equally acceptable, unless a question explicitly
states otherwise.] Not assuming the
variances to be equal has the advantage of making fewer restrictive
assumptions about the population. For this reason it should be used in
the rare cases where the $P$-values obtained under the different
assumptions are substantially different. This version of the test
statistic is also slightly easier to calculate by hand, since
(\@ref(eq:seDmu-ne)) is a slightly simpler formula than
(\@ref(eq:seD2))–(\@ref(eq:sehatjoint)). On the other hand, the test statistic
which does assume equal standard deviations has the advantage that it is
more closely related to analogous tests used in more general contexts
(especially the method of linear regression modelling, discussed in
Chapter \@ref(c-regression)). It is also preferable when the sample sizes
are very small, as discussed below.
#### Using the $t$ distribution {-}
As discussed in Section \@ref(s-contd-probdistrs), it is often assumed
that the population distributions of the variables under consideration
are described by particular probability distributions. In this chapter,
however, such assumptions have so far been avoided. This is a
consequence of the Central Limit Theorem, which ensures that as long as
the sample sizes are large enough, the sampling distribution of the
two-sample $t$-test statistic is approximately the standard normal
distribution, irrespective of the forms of the population distributions
of $Y$ in the two groups. In this section we briefly describe variants
of the test and confidence interval which *do* assume that the
population distributions are of a particular form, specifically that
they are normal distributions. This changes the sampling distribution
that is used for the test statistic and for the multiplier of the
confidence interval, but the analyses are otherwise unchanged.
For the significance test, there are again two variants depending on the
assumptions about the the population standard deviations $\sigma_{1}$
and $\sigma_{2}$. Consider first the case where these are assumed to be
equal. The sampling distribution is then given by the following result,
which now holds for *any* sample sizes $n_{1}$ and $n_{2}$:
- In the two-sample situation specified by assumptions 1–4 at the beginning of Section \@ref(ss-means-inference-intro) (including the assumption of equal population
standard deviations, $\sigma_{1}=\sigma_{2}=\sigma$), and if also
the distribution of $Y$ is a normal distribution in both groups, the
sampling distribution of the t-test statistic (\@ref(eq:ztestmuDb)) is a
$t$ distribution with $n_{1}+n_{2}-2$ degrees of freedom when the
null hypothesis $H_{0}: \;
\Delta=\mu_{2}-\mu_{1}=0$ is true in the population.
The $\mathbf{t}$ **distributions** mentioned in this result are a family
of distributions with different degrees of freedom, in a similar way as
the $\chi^{2}$ distributions discussed in Section
\@ref(ss-tables-chi2test-sdist). All $t$ distributions are symmetric
around 0. Their shape is quite similar to that of the standard normal
distribution, except that the variance of a $t$ distribution is somewhat
larger and its tails thus heavier. The difference is noticeable only
when the degrees of freedom are small, as seen in Figure
\@ref(fig:f-tdistr1). This shows the curves for the $t$ distributions with 6
and 30 degrees of freedom, compared to the standard normal distribution.
It can be seen that the $t_{30}$ distribution is already very similar to
the $N(0,1)$ distribution. With degrees of freedom larger than about 30,
the difference becomes almost indistinguishable.
![(\#fig:f-tdistr1)Curves of two $t$ distributions with small degrees of freedom, compared to the standard normal distribution.](tdistr1){width="13cm"}
If we use this result for the test, the $P$-value is obtained from the
$t$ distribution with $n_{1}+n_{2}-2$ degrees of freedom (often denoted
$t_{n1+n2-2}$). The principles of doing this are exactly the same as
those described in Section \@ref(ss-probs-test1sample-samplingd), and can
be graphically illustrated by plots similar to those in Figure
\@ref(fig:f-pval-prob). Precise $P$-values are
again obtained using a computer. In fact, $P$-values in SPSS output for
the two-sample $t$-test (c.f. Figure \@ref(fig:f-spss2test)) are actually those obtained from the $t$
distribution (with the degrees of freedom shown in the column labelled
“df”) rather than the standard normal distribution. Differences between
the two are, however, very small if the sample sizes are even moderately
large, because then the degrees of freedom $df=n_{1}+n_{2}-2$ are large
enough for the two distributions to be virtually identical. This is the
case, for instance, in both of the examples considered so far in this
chapter, where $df=1102$ in Example 7.2 and $df=131$ in Example 7.3.
If precise $P$-values from the $t$ distribution are not available, upper
bounds for them can again be obtained using appropriate tables, in the
same way as in Section \@ref(ss-probs-test1sample-samplingd). Now,
however, the critical values depend also on the degrees of freedom.
Because of this, introductory text books on statistics typically include
a table of critical values for $t$ distributions for a selection of
degrees of freedom. A table of this kind is shown in the Appendix at the end of this course pack. Each row of the
table corresponds to a $t$ distribution with the degrees of freedom
given in the column labelled “df”. As here, such tables typically
include all degrees of freedom between 1 and 30, plus a selection of
larger values, here 40, 60 and 120.
The last row is labelled “$\infty$”, the mathematical symbol for
infinity. This corresponds to the standard normal distribution, as a $t$
distribution with infinite degrees of freedom is equal to the standard
normal. The practical implication of this is that the standard normal
distribution is a good enough approximation for any $t$ distribution
with reasonably large degrees of freedom. The table thus lists
individual degrees of freedom only up to some point, and the last row
will be used for any values larger than this. For degrees of freedom
between two values shown in the table (e.g. 50 when only 40 and 60 are
given), it is best to use the values for the nearest available degrees
of freedom *below* the required ones (e.g. use 40 for 50). This will
give a “conservative” approximate $P$-value which may be slightly larger
than the exact value.
As for the standard normal distribution, the table is used to identify
critical values for different significance levels (c.f. the information
in Table \@ref(tab:t-ttable)). For example, if the degrees of freedom are 20,
the critical value for two-sided tests at the significance level 0.05 in
the “0.025” column on the row labelled “20”. This is 2.086. In general,
critical values for $t$ distributions are somewhat larger than
corresponding values for the standard normal distribution, but the
difference between the two is quite small when the degrees of freedom
are reasonably large.
The $t$-test and the $t$ distribution are among the oldest tools of
statistical inference. They were introduced in 1908 by W. S. Gosset,^[Student (1908). “The probable error of a mean”. *Biometrika*
**6**, 1–25.]
initially for the one-sample case discussed in Section
\@ref(s-means-1sample). Gosset was working as a chemist at the Guinness
brewery at St. James’ Gate, Dublin. He published his findings under the
pseudonym “Student”, and the distribution is often known as *Student’s
$t$ distribution*.
These results for the sampling distribution hold when the population
standard deviations $\sigma_{1}$ and $\sigma_{2}$ are assumed to be
equal. If this assumption is not made, the test statistic is again
calculated using formulas (\@ref(eq:seDmu-ne)) and (\@ref(eq:ztestmuD)). This case is mathematically more difficult than the
previous one, because the sampling distribution of the test statistic
under the null hypothesis is then not exactly a $t$ distribution even
when the population distributions are normal. One way of dealing with
this complication (which is known as the Behrens–Fisher problem) is to
find a $t$ distribution which is a good approximation of the true
sampling distribution. The degrees of freedom of this approximating
distribution are given by
\begin{equation}
df=\frac{\left(\frac{s^{2}_{1}}{n_{1}}+\frac{s^{2}_{2}}{n_{2}}\right)^{2}}{\left(\frac{s_{1}^{2}}{n_{1}}\right)^{2}\;\left(\frac{1}{n_{1}-1}\right)+\left(\frac{s_{2}^{2}}{n_{2}}\right)^{2}\;\left(\frac{1}{n_{2}-1}\right)}.
(\#eq:satter-df)