-
Notifications
You must be signed in to change notification settings - Fork 7
/
09-MY451-3waytables.Rmd
683 lines (538 loc) · 36.1 KB
/
09-MY451-3waytables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
# Analysis of 3-way contingency tables {#c-3waytables}
In Section \@ref(s-descr1-2cat) and Chapter \@ref(c-tables) we discussed
the analysis of two-way contingency tables (crosstabulations) for
examining the associations between two categorical variables. In this
section we extend this by introducing the basic ideas of **multiway
contingency tables** which include more than two categorical variables.
We focus solely on the simplest instance of them, a **three-way table**
of three variables.
This topic is thematically related also to some of Chapter
\@ref(c-regression), in that a multiway contingency table can be seen as
a way of implementing for categorical variables the ideas of statistical
control that were also a feature of the multiple linear regression model
of Section \@ref(s-regression-multiple). Here, however, we will not
consider formal regression models for categorical variables (these are
mentioned only briefly at the end of the chapter). Instead, we give
examples of analyses which simply apply familiar methods for two-way
tables repeatedly for tables of two variables at fixed values of a third
variable.
The discussion is organised arond three examples. In each case we start
with a two-way table, and then introduce a third variable which we want
to control for. This reveals various features in the examples, to
illustrate the types of findings that may be uncovered by statistical
control.
**Example 9.1: Berkeley admissions**
Table \@ref(tab:t-berkeley1) summarises data on applications for admission to
graduate study at the University of California, Berkeley, for the fall
quarter 1973.^[These data, which were produced by the Graduate Division of UC
Berkeley, were first discussed in Bickel, P. J., Hammel, E. A., and
O’Connell, J. W. (1975), “Sex bias in graduate admissions: Data from
Berkeley”, *Science* **187**, 398–404. They have since become a
much-used teaching example. The version of the data considered here
are from Freedman, D., Pisani, R., and Purves, R., *Statistics*
(W. W. Norton, 1978).] The data are for five of the six departments with the
largest number of applications, labelled below Departments 2–5
(Department 1 will be discussed at the end of this section). Table
\@ref(tab:t-berkeley1) shows the two-way contingency table of the sex of the
applicant and whether he or she was admitted to the university.
-----------------------------------------------------
\ Admitted Admitted \ \
Sex No Yes % Yes Total
------------- ------------ ---------- ------- -------
Male 1180 686 36.8 1866
Female 1259 468 27.1 1727
Total 2439 1154 32.1 3593
-----------------------------------------------------
: (\#tab:t-berkeley1)Table of sex of applicant vs. admission in the Berkeley admissions
data. The column labelled ‘% Yes’ is the percentage of applicants
admitted within each row. $\chi^{2}=38.4$, $df=1$, $P<0.001$.
The percentages in Table \@ref(tab:t-berkeley1) show that men were more
likely to be admitted, with a 36.8% success rate compared to 27.1% for
women. The difference is strongly significant, with $P<0.001$ for the
$\chi^{2}$ test of independence. If this association was interpreted
causally, it might be regarded as evidence of sex bias in the admissions
process. However, other important variables may also need to be
considered in the analysis. One of them is the academic department to
which an applicant had applied. Information on the department as well as
sex and admission is shown in Table \@ref(tab:t-berkeley2).
-----------------------------------------------------------------------------
\ Admitted Admitted \ \
Department Sex No Yes % Yes Total
---------------------------- ---------- ---------- ---------- ------- -------
2 Male 207 353 63.0 560
Female 8 17 68.0 25
Total 215 370 63.2 585
$\chi^{2}=0.25$, $P=0.61$
3 Male 205 120 36.9 325
Female 391 202 34.1 593
Total 596 322 35.1 918
$\chi^{2}=0.75$, $P=0.39$
4 Male 279 138 33.1 417
Female 244 131 34.9 375
Total 523 269 34.0 792
$\chi^{2}=0.30$, $P=0.59$
5 Male 138 53 27.7 191
Female 299 94 23.9 393
Total 437 147 25.2 584
$\chi^{2}=1.00$, $P=0.32$
6 Male 351 22 5.9 373
Female 317 24 7.0 341
Total 668 46 6.4 714
$\chi^{2}=0.38$, $P=0.54$
Total 2439 1154 32.1 3593
-----------------------------------------------------------------------------
: (\#tab:t-berkeley2)Sex of applicant vs. admission by academic department in the
Berkeley admissions data.
Table \@ref(tab:t-berkeley2) is a *three-way* contingency table, because each
of its internal cells shows the number of applicants with a particular
combination of three variables: department, sex and admission status.
For example, the frequency 207 in the top left corner indicates that
there were 207 male applicants to department 2 who were not admitted.
Table \@ref(tab:t-berkeley2) is presented in the form of a series of tables
of sex vs. admission, just like in the original two-way table
\@ref(tab:t-berkeley1), but now with one table for each department. These are
known as **partial tables** of sex vs. admission, **controlling for**
department. The word “control” is used here in the same sense as before:
each partial table summarises the data for the applicants to a single
department, so the variable “department” is literally held constant
within the partial tables.
Table \@ref(tab:t-berkeley2) also contains the marginal distributions of sex
and admission status within each department. They can be used to
construct the other two possible two-way tables for these variables, for
department vs. sex of applicant and department vs. admission status.
This information, summarised in Table \@ref(tab:t-berkeley3), is discussed
further below.
The association between sex and admission within each partial table can
be examined using methods for two-way tables. For every one of them, the
$\chi^{2}$ test shows that the hypothesis of independence cannot be
rejected, so there is no evidence of sex bias within any department. The
apparent association in Table \@ref(tab:t-berkeley1) is thus spurious, and
disappears when we control for department. Why this happens can be
understood by considering the distributions of sex and admissions across
departments, as shown in Table \@ref(tab:t-berkeley3). Department is clearly
associated with sex of the applicant: for example, almost all of the
applicants to department 2, but only a third of the applicants to
department 5 are men. Similarly, there is an association between
department and admission: for example, nearly two thirds of the
applicants to department 2 but only a quarter of the applicants to
department 5 were admitted. It is the combination of these associations
which induces the spurious association between sex and admission in
Table \@ref(tab:t-berkeley1). In essence, women had a lower admission rate
overall because relatively more of them applied to the more selective
departments and fewer to the easy ones.
Of all applicants Department 2 3 4 5 6
---------------------- -------------- ----- ----- ----- -----
% Male 96 35 53 33 52
% Admitted 63 35 34 25 6
Number of applicants 585 918 792 584 714
: (\#tab:t-berkeley3)Percentages of male applicants and applicants admitted by department
in the Berkeley admissions data.
One possible set of causal connections leading to a spurious association
between $X$ and $Y$ was represented graphically by Figure \@ref(fig:f-xyzspurious). There are, however, other possibilities which may
be more appropriate in particular cases. In the admissions example,
department (corresponding to the control variable $Z$) cannot be
regarded as the cause of the sex of the applicant. Instead, we may
consider the causal chain Sex $\longrightarrow$ Department
$\longrightarrow$ Admission. Here department is an *intervening
variable* between sex and admission rather than a common cause of them.
We can still argue that sex has an effect on admission, but it is an
*indirect effect* operating through the effect of sex on choice of
department. The distinction is important for the original research
question behind these data, that of possible sex bias in admissions. A
direct effect of sex on likelihood on admission might be evidence of
such bias, because it might indicate that departments are treating male
and female candidates differently. An indirect effect of the kind found
here does not suggest bias, because it results from the applicants’ own
choices of which department to apply to.
In the admissions example a strong association in the initial two-way
table was “explained away” when we controlled for a third variable. The
next example is one where controlling leaves the initial association
unchanged.
**Example 9.2:** **Importance of short-term gains for investors
(continued)**
Table \@ref(tab:t-investors) showed a
relatively strong association between a person’s age group and his or
her attitude towards short-term gains as an investment goal. This
association is also strongly significant, with $P<0.001$ for the
$\chi^{2}$ test of independence. Table \@ref(tab:t-investors3) shows the
crosstabulations of these variables, now controlling also for the
respondent’s sex. The association is now still significant in both
partial tables. An investigation of the row proportions suggests that
the pattern of association is very similar in both tables, as is its
strength as measured by the $\gamma$ statistic ($\gamma=-0.376$ among
men and $\gamma=-0.395$ among women). The conclusions obtained from the
original two-way table are thus unchanged after controlling for sex.
----------------------------------------------------------------------------
**MEN** \ Slightly \ Very \
Age group Irrelevant important Important important Total
---------------- --------------- ----------- ----------- ----------- -------
Under 45 29 35 30 22 116
0.250 0.302 0.259 0.190 1.000
45–54 83 60 52 29 224
0.371 0.268 0.232 0.129 1.000
55–64 116 40 28 16 200
0.580 0.200 0.140 0.080 1.000
65 and over 150 53 16 12 231
0.649 0.229 0.069 0.052 1.000
Total 378 188 126 79 771
0.490 0.244 0.163 0.102 1.000
----------------------------------------------------------------------------
: (\#tab:t-investors3)Frequencies of respondents by age group and attitude towards
short-term gains in Example 9.2, controlling for sex of respondent.
The numbers below the frequencies are proportions within rows. $\chi^{2}=82.4$, $df=9$, $P<0.001$. $\gamma=-0.376$.
----------------------------------------------------------------------------
**WOMEN** \ Slightly \ Very \
Age group Irrelevant important Important important Total
---------------- --------------- ----------- ----------- ----------- -------
Under 45 8 10 8 4 30
0.267 0.333 0.267 0.133 1.000
45–54 28 17 5 8 58
0.483 0.293 0.086 0.138 1.000
55–64 37 9 3 4 53
0.698 0.170 0.057 0.075 1.000
65 and over 43 11 3 3 60
0.717 0.183 0.050 0.050 1.000
Total 116 47 19 19 201
0.577 0.234 0.095 0.095 1.000
----------------------------------------------------------------------------
: (\#tab:t-investors3)Frequencies of respondents by age group and attitude towards
short-term gains in Example 9.2, controlling for sex of respondent.
The numbers below the frequencies are proportions within rows. $\chi^{2}=27.6$, $df=9$, $P=0.001$. $\gamma=-0.395$.
**Example 9.3**: ***The Titanic***
The passenger liner RMS *Titanic* hit an iceberg and sank in the North
Atlantic on 14 April 1912, with heavy loss of life. Table
\@ref(tab:t-titanic2) shows a crosstabulation of the people on board the
*Titanic*, classified according to their status (as male passenger,
female or child passenger, or member of the ship’s crew) and whether
they survived the sinking.^[The data are from the 1912 report of the official British Wreck
Commissioner’s inquiry into the sinking, available at
<http://www.titanicinquiry.org>.] The $\chi^{2}$ test of independence has
$P<0.001$ for this table, so there are statistically significant
differences in probabilities of survival between the groups. The table
suggests, in particular, that women and children among the passengers
were more likely to survive than male passengers or the ship’s crew.
-----------------------------------------------------------
Survivor: \ \
Group Yes No Total
---------------------------- ---------- --------- ---------
Male passenger 146 659 805
(0.181) (0.819) (1.000)
Female or child passenger 353 158 511
(0.691) (0.309) (1.000)
Crew member 212 673 885
(0.240) (0.760) (1.000)
Total 711 1490 2201
(0.323) (0.677) (1.000)
-----------------------------------------------------------
: (\#tab:t-titanic2)Survival status of the people aboard the *Titanic*, divided into
three groups. The numbers in brackets are proportions of survivors and
non-survivors within each group. $\chi^{2}=418$, $\text{df}=2$, $P<0.001$.
We next control also for the class in which a person was travelling,
classified as first, second or third class. Since class does not apply
to the ship’s crew, this analysis is limited to the passengers,
classified as men vs. women and children. The two-way table of sex by
survival status for them is given by Table \@ref(tab:t-titanic2), ignoring
the row for crew members. This association is strongly significant, with
$\chi^{2}=344$ and $P<0.001$.
------------------------------------------------------
\ \ Survivor: \ \
Class Group Yes No Total
-------- ---------------- ------------ ------- -------
First Man 57 118 175
0.326 0.674 1.000
Woman or child 146 4 150
0.973 0.027 1.000
Total 203 122 325
0.625 0.375 1.000
Second Man 14 154 168
0.083 0.917 1.000
Woman or child 104 13 117
0.889 0.111 1.000
Total 118 167 285
0.414 0.586 1.000
Third Man 75 387 462
0.162 0.838 1.000
Woman or child 103 141 244
0.422 0.578 1.000
Total 178 528 706
0.252 0.748 1.000
Total 499 817 1316
0.379 0.621 1.000
------------------------------------------------------
:(\#tab:t-titanic3)Survival status of the passengers of the *Titanic*, classified by
class and sex. The numbers below the frequencies are proportions
within rows.
Two-way tables involving class (not shown here) suggest that it is
mildly associated with sex (with percentages of men 54%, 59% and 65% in
first, second and third class respectively) and strongly associated with
survival (with 63%, 41% and 25% of the passengers surviving). It is thus
possible that class might influence the association between sex and
survival. This is investigated in Table \@ref(tab:t-titanic3), which shows
the partial associations between sex and survival status, controlling
for class. This association is strongly significant (with $P<0.001$ for
the $\chi^{2}$ test) in every partial table, so it is clearly not
explained away by associations involving class. The direction of the
association is also the same in each table, with women and children more
likely to survive than men among passengers of every class.
The presence and direction of the association in the two-way Table
\@ref(tab:t-titanic2) are thus preserved and similar in every partial table
controlling for class. However, there appear to be differences in the
*strength* of the association between the partial tables. Considering,
for example, the ratios of the proportions in each class, women and
children were about 3.0 ($=0.973/0.326$) times more likely to survive
than men in first class, while the ratio was about 10.7 in second class
and 2.6 in the third. The contrast of men vs. women and children was
thus strongest among second-class passengers. This example differs in
this respect from the previous ones, where the associations were similar
in each partial table, either because they were all essentially zero
(Example 9.1) or non-zero but similar in both direction and strength
(Example 9.2).
We are now considering three variables, class, sex and survival.
Although it is not necessary for this analysis to divide them into
explanatory and response variables, introducing such a distinction is
useful for discussion of the results. Here it is most natural to treat
survival as the response variable, and both class and sex as explanatory
variables for survival. The associations in the partial tables in Table
\@ref(tab:t-titanic3) are then partial associations between the response
variable and one of the explanatory variables (sex), controlling for the
other explanatory variable (class). As discussed above, the strength of
this partial association is different for different values of class.
This is an example of a statistical **interaction**. In general, there
is an interaction between two explanatory variables if the strength and
nature of the partial association of (either) one of them on a response
variable depends on the value at which the other explanatory variable is
controlled. Here there is an interaction between class and sex, because
the association between sex and survival is different at different
levels of class.
Interactions are an important but challenging element of many
statistical analyses. Important, because they often correspond to
interesting and subtle features of associations in the data.
Challenging, because understanding and interpreting them involves
talking about (at least) three variables at once. This can seem rather
complicated, at least initially. It adds to the difficulty that
interactions can take many forms. In the *Titanic* example, for
instance, the nature of the class-sex interaction was that the
association between sex and survival was in the same direction but of
different strengths at different levels of class. In other cases
associations may disappear in some but not all of the partial tables, or
remain strong but in different directions in different ones. They may
even all or nearly all be in a different direction from the association
in the original two-way table, as in the next example.
**Example 9.4: Smoking and mortality**
A health survey was carried out in Whickham near Newcastle upon Tyne in
1972–74, and a follow-up survey of the same respondents twenty years
later.^[The two studies are reported in Tunbridge, W. M. G. et al. (1977).
“The spectrum of thyroid disease in a community: The Whickham
survey”. *Clinical Endocrinology* **7**, 481–493, and Vanderpump,
M. P. J. et al. (1995). “The incidence of thyroid disorders in the
community: A twenty-year follow-up of the Whickham survey”.
*Clinical Endocrinology* **43**, 55–69. The data are used to
illustrate Simpson’s paradox by Appleton, D. R. et al. (1996).
“Ignoring a covariate: An example of Simpson’s paradox”. *The
American Statistician* **50**, 340–341.] Here we consider only the $n=1314$ female respondents who
were classified by the first survey either as current smokers or as
never having smoked. Table \@ref(tab:t-whickham1) shows the
crossclassification of these women according to their smoking status in
1972–74 and whether they were still alive twenty years later. The
$\chi^{2}$ test indicates a strongly significant association (with
$P=0.003$), and the numbers suggest that a smaller proportion of smokers
than of nonsmokers had died between the surveys. Should we thus conclude
that smoking helps to keep you alive? Probably not, given that it is
known beyond reasonable doubt that the causal relationship between
smoking and mortality is in the opposite direction. Clearly the picture
has been distorted by failure to control for some relevant further
variables. One such variable is the age of the respondents.
Smoker Dead Alive Total
-------- ------- ------- -------
Yes 139 443 582
0.239 0.761 1.000
No 230 502 732
0.314 0.686 1.000
Total 369 945 1314
0.281 0.719 1.000
: (\#tab:t-whickham1)Table of smoking status in 1972–74 vs. twenty-year survival among
the respondents in Example 9.4. The numbers below the frequencies are
proportions within rows.
Table \@ref(tab:t-whickham2) shows the partial tables of smoking vs. survival
controlling for age at the time of the first survey, classified into
seven categories. Note first that this three-way table appears somewhat
different from those shown in Tables \@ref(tab:t-berkeley2),
\@ref(tab:t-investors3) and \@ref(tab:t-titanic3). This is because one variable,
survival status, is summarised only by the percentage of survivors
within each combination of age group and smoking status. This is a
common trick to save space in three-way tables involving dichotomous
variables like survival here. The full table can easily be constructed
from these numbers if needed. For example, 98.4% of the nonsmokers aged
18–24 were alive at the time of the second survey. Since there were a
total of 62 respondents in this group (as shown in the last column),
this means that 61 of them (i.e. 98.4%) were alive and 1 (or 1.6%) was
not.
The percentages in Table \@ref(tab:t-whickham2) show that in five of the
seven age groups the proportion of survivors is higher among nonsmokers
than smokers, i.e. these partial associations in the sample are in the
opposite direction from the association in Table \@ref(tab:t-whickham1). This
reversal is known as **Simpson’s paradox**. The term is somewhat
misleading, as the finding is not really paradoxical in any logical
sense. Instead, it is again a consequence of a particular pattern of
associations between the control variable and the other two variables.
----------------------------------------------------------------------------------------------
\ \% Alive after 20 years: \ Number (in 1972--74): \
Age group Smokers Nonsmokers Smokers Nonsmokers
---------------- --------------------------- ------------ ----------------------- ------------
18–24 96.4 98.4 55 62
25–34 97.6 96.8 124 157
35–44 87.2 94.2 109 121
45–54 79.2 84.6 130 78
55–64 55.7 66.9 115 121
65–74 19.4 21.7 36 129
75– 0.0 0.0 12 64
All age groups 76.1 68.6 582 732
----------------------------------------------------------------------------------------------
: (\#tab:t-whickham2)Percentage of respondents in Example 9.4 surviving at the time of
the second survey, by smoking status and age group in 1972–74.
The two-way tables of age by survival and age by smoking are shown side
by side in Table \@ref(tab:t-whickham3). The table is somewhat elaborate and
unconventional, so it requires some explanation. The rows of the table
correspond to the age groups, identified by the second column, and the
frequencies of respondents in each age group are given in the last
column. The left-hand column shows the percentages of survivors within
each age group. The right-hand side of the table gives the two-way table
of age group and smoking status. It contains percentages calculated both
within the rows (without parentheses) and columns (in parentheses) of
the table. As an example, consider numbers for the age group 18–24.
There were 117 respondents in this age group at the time of the first
survey. Of them, 47% were then smokers and 53% were nonsmokers, and 97%
were still alive at the time of the second survey. Furthermore, 10% of
all the 582 smokers, 9% of all the 732 nonsmokers and 9% of the 1314
respondents overall were in this age group.
-------------------------------------------------------------------------------------
\ \ Row and column \% \ \ \
% Alive Age group Smokers Nonsmokers Total \% Count
------------ ------------- ----------------------- --------------- ---------- -------
97 18–24 47 53 100 117
(10) (9) (9)
97 25–34 44 56 100 281
(21) (21) (21)
91 35–44 47 53 100 230
(19) (17) (18)
81 45–54 63 38 100 208
(22) (11) (16)
61 55–64 49 51 100 236
(20) (17) (13)
21 65–74 22 78 100 165
(6) (18) (13)
0 75– 17 83 100 77
(2) (9) (6)
72 Total % 44 56 100
(100) (100) (100)
945 Total count 582 732 1314
-------------------------------------------------------------------------------------
: (\#tab:t-whickham3)Two-way contingency tables of age group vs. survival (on the left)
and age group vs. smoking (on the right) in Example 6.4. The
percentages in parentheses are column percentages (within smokers or
nonsmokers) and the ones without parentheses are row percentages
(within age groups).
Table \@ref(tab:t-whickham3) shows a clear association between age and
survival, for understandable reasons mostly unconnected with smoking.
The youngest respondents of the first survey were highly likely and the
oldest unlikely to be alive twenty years later. There is also an
association between age and smoking: in particular, the proportion of
smokers was lowest among the oldest respondents. The implications of
this are seen perhaps more clearly by considering the column
proportions, i.e. the age distributions of smokers and nonsmokers in the
original sample. These show that the group of nonsmokers was
substantially older than that of the smokers; for example, 27% of the
nonsmokers but only 8% of the smokers belonged to the two oldest age
groups. It is this imbalance which explains why nonsmokers, more of whom
are old, appear to have lower chances of survival until we control for
age.
The discussion above refers to the *sample* associations between smoking
and survival in the partial tables, which suggest that mortality is
higher among smokers than nonsmokers. In terms of statistical
significance, however, the evidence is fairly weak: the association is
even borderline significant only in the 35–44 and 55–64 age groups, with
$P$-values of 0.063 and 0.075 respectively for the $\chi^{2}$ test. This
is an indication not so much of lack of a real association but of the
fact that these data do not provide much power for detecting it. Overall
twenty-year mortality is a fairly rough measure of health status for
comparisons of smokers and nonsmokers, especially in the youngest and
oldest age groups where mortality is either very low or very high for
everyone. Differences are likely to be have been further diluted by many
of the original smokers having stopped smoking between the surveys. This
example should thus not be regarded as a serious examination of the
health effects of smoking, for which much more specific data and more
careful analyses are required.^[For one remarkable example of such studies, see Doll, R. et
al. (2004), “Mortality in relation to smoking: 50 years’
observations on male British doctors”, *British Medical Journal*
**328**, 1519–1528, and Doll, R. and Hill, A. B. (1954), “The
Mortality of doctors in relation to their smoking habits: A
preliminary report”, *British Medical Journal* **228**, 1451–1455.
The older paper is reprinted together with the more recent one in
the 2004 issue of *BMJ*.]
The Berkeley admissions data discussed earlier provide another example
of a partial Simpson’s paradox. Previously we considered only
departments 2–6, for none of which there was a significant partial
association between sex and admission. For department 1, the partial
table indicates a strongly significant difference in favour of women,
with 82% of the 108 female applicants admitted, compared to 62% of the
825 male applicants. However, the two-way association between sex and
admission for departments 1–6 combined remains strongly significant and
shows an even larger difference in favour of men than before. This
result can now be readily explained as a result of imbalances in sex
ratios and rates of admission between departments. Department 1 is both
easy to get into (with 64% admitted) and heavily favoured by men (88% of
the applicants). These features combine to contribute to higher
admissions percentages for men overall, even though within department 1
itself women are more likely to be admitted.
In summary, the examples discussed above demonstrate that many things
can happen to an association between two variables when we control for a
third one. The association may disappear, indicating that it was
spurious, or it may remain similar and unchanged in all of the partial
tables. It may also become different in different partial tables,
indicating an interaction. Which of these occurs depends on the patterns
of associations between the control variable and the other two
variables. The crucial point is that the two-way table alone cannot
reveal which of these cases we are dealing with, because the counts in
the two-way table could split into three-way tables in many different
ways. The only way to determine how controlling for other variables will
affect an association is to actually do so. This is the case not only
for multiway contingency tables, but for all methods of statistical
control, in particular multiple linear regression and other regression
models.
Two final remarks round off our discussion of multiway contingency
tables:
- Extension of the ideas of three-way tables to four-way and larger
contingency tables is obvious and straightforward. In such tables,
every cell corresponds to the subjects with a particular combination
of the values of four or more variables. This involves no new
conceptual difficulties, and the only challenge is how to arrange
the table for convenient presentation. When the main interest is in
associations between a particular pair of two variables, the usual
solution is to present a set of partial two-way tables for them, one
for each combination of the other (control) variables. Suppose, for
instance, that in the university admissions example we had obtained
similar data at two different years, say 1973 and 2003. We would
then have four variables: year, department, sex and
admission status. These could be summarised as in Table
\@ref(tab:t-berkeley2), except that each partial table for sex
vs. admission would now be conditional on the values of both year
and department. The full four-way table would then consist of ten
$2\times 2$ partial tables, one for each of the ten combinations of
two years and five departments, (i.e. applicants to Department 2 in
1973, Department 2 in 2003, and so on to Department 6 in 2003).
- The only inferential tool for multiway contingency tables discussed
here was to arrange the table as a set of two-way partial tables and
to apply the $\chi^{2}$ test of independence to each of them. This
is a perfectly sensible approach and a great improvement over just
analysing two-way tables. There are, however, questions which cannot
easily be answered with this method. For example, when can we say
that associations in different partial tables are different enough
for us to declare that there is evidence of an interaction? Or what
if we want to consider many different partial associations, either
for a response variable with each of the other variables in turn, or
because there is no single response variable? More powerful methods
are required for such analyses. They are multiple regression models
like the multiple linear regression of Section
\@ref(s-regression-multiple), but modified so that they become
approriate for categorical response variables. Some of these models
are introduced on the course MY452.