-
Notifications
You must be signed in to change notification settings - Fork 7
/
02-MY451-descr1.Rmd
1947 lines (1666 loc) · 108 KB
/
02-MY451-descr1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Descriptive statistics {#c-descr1}
## Introduction {#s-descr1-intro}
This chapter introduces some common descriptive statistical methods. It
is organised around two dichotomies:
- Methods that are used only for variables with small numbers of
values, vs. methods that are used also or only for variables with
many values (see the end of Section \@ref(ss-intro-def-vartypes) for more on
this distinction). The former include, in particular, descriptive
methods for categorical variables, and the latter the methods for
continuous variables.
- **Univariate** descriptive methods which consider only one variable
at a time, vs. **bivariate** methods which aim to describe the
association between *two* variables.
Section \@ref(s-descr1-1cat) describes univariate methods for categorical
variables and Section \@ref(s-descr1-2cat) bivariate methods for cases
where both variables are categorical. Sections \@ref(s-descr1-1cont) and
\@ref(s-descr1-nums) cover univariate methods which are mostly used for
continuous variables. Section \@ref(s-descr1-2cont) lists some bivariate
methods where at least one variable is continuous; these methods are
discussed in detail elsewhere in the coursepack. The chapter concludes
with some general guidelines for presentation of descriptive tables and
graphs in Section \@ref(s-descr1-presentation).
## Example data sets {#s-descr1-examples}
Two examples are used to illustrate the methods throughout this chapter:
*Example: Country data* \label{country_example}
Consider data for 155 countries on three variables:
- The **region** where the country is located, coded as 1=Africa,
2=Asia, 3=Europe, 4=Latin America, 5=Northern America, 6=Oceania.
- A measure of the level of **democracy** in the country, measured on
an 11-point scale from 0 (lowest level of democracy) to
10 (highest).
- Gross Domestic Product (**GDP**) per capita, in thousands
of U.S. dollars.
Further information on the variables is given in the appendix to this
chapter (Section \@ref(s-descr1-app)), together with the whole data set,
shown in Table \@ref(tab:t-countrydata).
Region is clearly a discrete (and categorical), nominal-level variable,
and GDP a continuous, interval-level variable. The democracy index is
discrete; it is most realistic to consider its measurement level to be
ordinal, and it is regarded as such in this chapter. However, it is the
kind of variable which might in many analyses be treated instead as an
effectively continuous, interval-level variable.
*Example: Survey data on attitudes towards income redistribution*
The data for the second example come from Round 5 of the European Social
Survey (ESS), which was carried out in 2010.^[ESS Round 5: European Social Survey Round 5 Data (2010). Data file
edition 2.0. Norwegian Social Science Data Services, Norway - Data
Archive and distributor of ESS data.] The survey was fielded
in 28 countries, but here we use only data from 2344 respondents in the
UK. Two variables are considered:
- **Sex** of the respondent, coded as 1=Male, 2=Female.
- Answer to the following survey question:\
*“The government should take measures to reduce differences in
income levels”*,\
with five response options coded as “Agree strongly”=1, “Agree”=2,
“Neither agree nor disagree”=3, “Disagree”=4, and “Disagree
strongly”=5. This is a measure of the respondent’s **attitude**
towards income redistribution.
Both of these are discrete, categorical variables. Sex is binary and
attitude is ordinal.
Attitudes towards *income redistribution* are an example of the broader
topic of public opinion on welfare state policies. This is a large topic
of classic and current interest in the social sciences, and questions on
it have been included in many public opinion surveys.^[For recent findings, see for example Svallfors, S. (ed.) (2012),
*Contested Welfare States: Welfare Attitudes in Europe and Beyond*.
Stanford University Press.] Of key
interest is to explore the how people’s attitudes are associated with
their individual characteristics (including such factors as age, sex,
education and income) and the contexts in which they live (for example
the type of welfare regime adopted in their country). In section
\@ref(s-descr1-2cat) below we use descriptive statistics to examine such
associations between sex and attitude in this sample.
## Single categorical variable {#s-descr1-1cat}
### Describing the sample distribution {#ss-descr1-1cat-distr}
The term *distribution* is very important in statistics. In this section
we consider the distribution of a single variable in the observed data,
i.e. its *sample distribution*:
- The **sample distribution** of a variable consists of a list of the
values of the variable which occur in a sample, together with the
number of times each value occurs.
Later we will discuss other kinds of distributions, such as population,
probability and sampling distributions, but they will all be variants of
the same concept.
The task of descriptive statistics for a single variable is to summarize
the sample distribution or some features of it. This can be done in the
form of tables, graphs or single numbers.
### Tabular methods: Tables of frequencies {#ss-descr1-1cat-tables}
When a variable has only a limited number of distinct values, its sample
distribution can be summarized directly from the definition given above.
In other words, we simply count and display the number of times each of
the values appears in the data. One way to do the display is as a table,
like the ones for region and the democracy index in the country data,
and attitude in the survey example, which are shown in Tables
\@ref(tab:t-region), \@ref(tab:t-democ) and \@ref(tab:t-attitude) respectively.
Region Frequency Proportion %
------------------ ----------- ------------ -------
Africa 48 0.310 31.0
Asia 44 0.284 28.4
Europe 34 0.219 21.9
Latin America 23 0.148 14.8
Northern America 2 0.013 1.3
Oceania 4 0.026 2.6
Total 155 1.000 100.0
: (\#tab:t-region)Frequency distribution of the region variable in the country data.
--------------------------------------------------------
Democracy \ \ \ Cumulative
score Frequency Proportion % %
----------- ----------- ------------ ------ ------------
0 35 0.226 22.6 22.6
1 12 0.077 7.7 30.3
2 4 0.026 2.6 32.9
3 6 0.039 3.9 36.8
4 5 0.032 3.2 40.0
5 5 0.032 3.2 43.2
6 12 0.077 7.7 50.9
7 13 0.084 8.4 59.3
8 16 0.103 10.3 69.6
9 15 0.097 9.7 79.3
10 32 0.206 20.6 99.9
Total 155 0.999 99.9
--------------------------------------------------------
: (\#tab:t-democ)Frequency distribution of the democracy index in the country
data.
Response Frequency Proportion % Cumulative %
-------------------------------- ----------- ------------ ------- --------------
Agree strongly (1) 366 0.156 15.6 15.6
Agree (2) 1090 0.465 46.5 62.1
Neither agree nor disagree (3) 426 0.182 18.2 80.3
Disagree (4) 387 0.165 16.5 96.8
Disagree strongly (5) 75 0.032 3.2 100.0
Total 2344 1.00 100.0
: (\#tab:t-attitude)Frequency distribution of responses to a question on attitude
towards income redistribution in the survey example.
Each row of such a table corresponds to one possible value of a
variable, and the second column shows the number of units with that
value in the data. Thus there are 48 countries from Africa and 44 from
Asia in the contry data set and 32 countries with the highest democracy
score 10, and so on. Similarly, 366 respondents in the survey sample
strongly agreed with the attitude question, and 75 strongly disagreed
with it. These counts are also called **frequencies**, a distribution
like this is a **frequency distribution**, and the table is also known
as a **frequency table**. The sum of the frequencies, given on the line
labelled “Total” in the tables, is the sample size $n$, here 155 for the
country data and 2344 for the survey data.
It is sometimes more convenient to consider relative values of the
frequencies instead of the frequencies themselves. The **relative
frequency** or **proportion** of a category of a variable is its
frequency divided by the sample size. For example, the proportion of
countries from Africa in the country data is $48/155=0.310$ (rounded to
three decimal places). A close relative of the proportion is the
**percentage**, which is simply proportion multiplied by a hundred; for
example, 31% of the countries in the sample are from Africa. The sum of
the proportions is one, and the sum of the percentages is one hundred
(because of rounding error, the sum in a reported table may be very
slightly different, as it is in Table \@ref(tab:t-democ)).
### Graphical methods: Bar charts {#ss-descr1-1cat-charts}
Graphical methods of describing data (*statistical graphics*) make use
of our ability to process and interpret even very large amounts of
visual information. The basic graph for summarising the sample
distribution of a discrete variable is a **bar chart**. It is the
graphical equivalent of a one-way table of frequencies.
Figures \@ref(fig:f-bars-region), \@ref(fig:f-bars-democ) and
\@ref(fig:f-bars-attitude) show the bar charts for region, democracy index
and attitude, corresponding to the frequencies in Tables \@ref(tab:t-region),
\@ref(tab:t-democ) and \@ref(tab:t-attitude). Each bar corresponds to one category
of the variable, and the height of the bar is proportional to the
frequency of observations in that category. This visual cue allows us to
make quick comparisons between the frequencies of different categories
by comparing the heights of the bars.
![(\#fig:f-bars-region)Bar chart of regions in the country data.](regions){height="9.5cm"}
![(\#fig:f-bars-democ)Bar chart of the democracy index in the country data.](democ){height="9.5cm"}
![(\#fig:f-bars-attitude)Bar chart of the attitude variable in the survey data example. Agreement with statement: ``The government should take measures to reduce differences in income levels''. European Social Survey, Round 5 (2010), UK respondents only.](bar_attitude){height="8cm"}
Some guidelines for drawing bar charts are:
- The heights of the bars may represent frequencies, proportions
or percentages. This only changes the units on the vertical axis but
not the relative heights of the bars. The shape of the graph will be
the same in each case. In Figure \@ref(fig:f-bars-region), the units are
frequencies, while in Figures \@ref(fig:f-bars-democ) and
\@ref(fig:f-bars-attitude) they are percentages.
- The bars do not touch each other, to highlight the discrete nature
of the variable.
- The bars *must* start at zero. It they do not, visual comparisons
between their heights are distorted and the graph becomes useless.
- If the variable is ordinal, the bars must be in the natural order of
the categories, as in Figures \@ref(fig:f-bars-democ) and
\@ref(fig:f-bars-attitude). If the variable is nominal, the order
is arbitrary. Often it makes sense to order the categories from
largest (i.e. the one with the largest frequency) to the smallest,
possibly leaving any “Others” category last. In Figure
\@ref(fig:f-bars-region), the frequency ordering would swap Northern
America and Oceania, but it seems more natural to keep Northern and
Latin America next to each other.
A bar chart is a relatively unexciting statistical graphic in that it
does not convey very much visual information. For nominal variables, in
particular, the corresponding table is often just as easy to understand
and takes less space. For ordinal variables, the bar chart has the
additional advantage that its shape shows how the frequencies vary
across the ordering of the categories. For example, Figure
\@ref(fig:f-bars-democ) quite effectively conveys the information that the
most common values of the democracy index are the extreme scores 0 and
10.
Sometimes you may see graphs which look like bar charts of this kind,
but which actually show the values of a single variable for some units
rather than frequncies or percentages. For example, a report on the
economies of East Asia might show a chart of GDP per capita for Japan,
China, South Korea and North Korea, with one bar for each country, and
their heights proportional to 28.2, 5.0, 17.8 and 1.3 respectively
(c.f. the data in Table \@ref(tab:t-countrydata)). The basic idea of such
graphs is the same as that of standard bar charts. However, they are not
particularly useful as descriptive statistics, since they simply display
values in the original data without any summarization or simplification.
### Simple descriptive statistics {#ss-descr1-1cat-descriptives}
Instead of the whole sample distribution, we may want to summarise only
some individual aspects of it, such as its central tendency or
variation. Descriptive statistics that are used for this purpose are
broadly similar for both discrete and continuous variables, so they will
be discussed together for both in Section \@ref(s-descr1-nums).
## Two categorical variables {#s-descr1-2cat}
### Two-way contingency tables {#ss-descr1-2cat-tables}
The next task we consider is how to describe the sample distributions of
two categorical variables together, and in so doing also summarise the
association between these variables. The key tool is a table which shows
the **crosstabulation** of the frequencies of the variables. This is
also known as a **contingency table**. Table \@ref(tab:t-sex-attitude) shows
such a table for the respondents’ sex and attitude in our survey
example. We use it to introduce the basic structure and terminology of
contingency tables:
------------------------------------------------------------------------------
\ Agree \ Neither agree \ Disagree \
Sex strongly Agree nor disagree Disagree strongly Total
-------- --------------- ------- --------------- ---------- ---------- -------
Male 160 439 187 200 41 1027
Female 206 651 239 187 34 1317
Total 366 1090 426 387 75 2344
------------------------------------------------------------------------------
: (\#tab:t-sex-attitude)*``The government should take measures to reduce differences in income levels''*: Two-way table of frequencies of respondents in the survey example,
by sex and attitude towards income redistribution. Data: European Social Survey, Round 5, 2010, UK respondents only.
- Because a table like \@ref(tab:t-sex-attitude) summarizes the values of
two variables, it is known as a **two-way** contingency table.
Similarly, the tables of single variables introduced in Section
\@ref(ss-descr1-1cat-tables) are *one-way* tables. It is also
possible to construct tables involving more than two variables,
i.e. three-way tables, four-way tables, and so on. These are
discussed in Chapter \@ref(c-3waytables).
- The variables in a contingency table may ordinal or nominal
(including dichotomous). Often an ordinal variable is derived by
grouping an originally continuous, interval-level variable, a
practice which is discussed further in Section \@ref(s-descr1-1cont).
- The horizontal divisions of a table (e.g. the lines corresponding to
the two sexes in Table \@ref(tab:t-sex-attitude)) are its **rows**, and
the vertical divisions (e.g. the survey responses in Table
\@ref(tab:t-sex-attitude)) are its **columns**.
- The size of a contingency table is stated in terms of the numbers of
its rows and columns. For example, Table \@ref(tab:t-sex-attitude) is a
$2\times
5$ (pronounced “two-by-five”) table, because it has two rows and
five columns. This notation may also be used symbolically, so that
we may refer generically to $R\times C$ tables which have
some (unspecified) number of $R$ rows and $C$ columns. The smallest
two-way table is thus a $2\times 2$ table, where both variables
are dichotomous.
- The intersection of a row and a column is a **cell** of the table.
The basic two-way contingency table shows in each cell the
number (frequency) of units in the data set with the corresponding
values of the row variable and the column variable. For example,
Table \@ref(tab:t-sex-attitude) shows that there were 160 male
respondents who strongly agreed with the statement, and 239 female
respondents who neither agreed nor disagreed with it. These
frequencies are also known as **cell counts**.
- The row and column labelled “Total” in Table \@ref(tab:t-sex-attitude)
are known as the **margins** of the table. They show the frequencies
of the values of the row and the column variable separately, summing
the frequencies over the categories of the other variable. For
example, the table shows that there were overall 1027
($=160+439+187+200+41$) male respondents, and that overall 75
($=41+34$) respondents strongly disagreed with the statement. In
other words, the margins are *one-way* tables of the frequencies of
each of the two variables, so for example the frequencies on the
margin for attitude in Table \@ref(tab:t-sex-attitude) are the same as
the ones in the one-way table for this variable shown in Table
\@ref(tab:t-attitude). The distributions described by the margins are
known as the **marginal distributions** of the row and
column variables. In contrast, the frequencies in the internal cells
of the table, which show how many units have each possible
*combination* of the row and column variables, describe the **joint
distribution** of the two variables.
- The number in the bottom right-hand corner of the table is the sum
of all of the frequencies, i.e. the total sample size $n$.
In addition to frequencies, it is often convenient to display
proportions or percentages. Dividing the frequencies by the sample size
gives overall proportions and (multiplying by a hundred) percentages.
This is illustrated in Table \@ref(tab:t-sex-attitude-pr), which shows the
overall proportions, obtained by dividing the frequencies in Table
\@ref(tab:t-sex-attitude) by $n=2344$. For example, out of all these
respondents, the proportion of 0.102 ($=239/2344$) were women who
neither agreed nor disagreed with the statement. The proportions are
also shown for the marginal distributions: for example, 15.6% (i.e. the
proportion $0.156=366/2344$) of the respondents strongly agreed with the
statement. The sum of the proportions over all the cells is 1, as shown
in the bottom right corner of the table.
------------------------------------------------------------------------------------
Agree \ Neither agree \ Disagree \
Sex strongly Agree nor disagree Disagree strongly Total
-------------- --------------- ------- --------------- ---------- ---------- -------
Male 0.068 0.187 0.080 0.085 0.017 0.438
Female 0.088 0.278 0.102 0.080 0.015 0.562
Total 0.156 0.465 0.182 0.165 0.032 1.000
------------------------------------------------------------------------------------
:(\#tab:t-sex-attitude-pr)*``The government should take measures to reduce differences in income levels''*: Two-way table of joint proportions of respondents in the survey
example, with each combination of sex and attitude towards income
redistribution. Data: European Social Survey, Round 5, 2010, UK respondents only.
### Conditional proportions {#ss-descr1-2cat-cond}
A two-way contingency table is symmetric in that it does not distinguish
between explanatory and response variables. In many applications,
however, this distinction is useful for interpretation. In our example,
for instance, it is natural to treat sex as the explanatory variable and
attitude towards income redistribution as the response response, and so
to focus the interpretation on how attitude may depend on sex.
The overall proportions are in such cases not the most relevant
quantities for interpretation of a table. Instead, we typically
calculate proportions within each category of the row variable or the
column variable, i.e. the **conditional proportions** of one variable
given the other. The numbers in brackets in Table
\@ref(tab:t-sex-attitude-row) show these proportions calculated for each
*row* of Table \@ref(tab:t-sex-attitude) (Table \@ref(tab:t-sex-attitude-row) also
includes the actual frequencies; it is advisable to include them even
when conditional proportions are of most interest, to show the numbers
on which the proportions are based). In other words, these are the
conditional proportions of attitude towards income redistribution given
sex, i.e. separately for men and women. For example, the number 0.156 in
the top left-hand corner of Table \@ref(tab:t-sex-attitude-row) is obtained
by dividing the number of male respondents who agreed strongly with the
statement (160) by the total number of male respondents (1027). Thus
15.6% of the men strongly agreed, and for example 2.6% of women strongly
disagreed with the statement. The (1.0) in the last column of the table
indicate that the proportions sum to 1 along each row, to remind us that
the conditional proportions have been calculated within the rows. The
bracketed proportions in the ‘Total’ row are the proportions of the
*marginal* distribution of the attitude variable, so they are the same
as the proportions in the ‘Total’ row of Table \@ref(tab:t-sex-attitude-pr).
-------------------------------------------------------------------------------------
Agree \ Neither agree \ Disagree \
Sex strongly Agree nor disagree Disagree strongly Total
------------- --------------- --------- --------------- ---------- ---------- -------
Male 160 439 187 200 41 1027
(0.156) (0.428) (0.182) (0.195) (0.040) (1.0)
Female 206 651 239 187 34 1317
(0.156) (0.494) (0.182) (0.142) (0.026) (1.0)
Total 366 1090 426 387 75 2344
(0.156) (0.465) (0.182) (0.165) (0.032) (1.0)
-------------------------------------------------------------------------------------
: (\#tab:t-sex-attitude-row)*``The government should take measures to reduce differences in income levels''*: Two-way table of frequencies of respondents in the survey example,
by sex and attitude towards income redistribution. The numbers in
brackets are proportions within the rows, i.e. conditional proportions
of attitude given sex. Data: European Social Survey, Round 5, 2010, UK respondents only.
We could also have calculated conditional proportions within the
*columns*, i.e. for sex given attitude. For example, the proportion
$0.563=206/366$ of all respondents who strongly agreed with the
statement are women. These, however, seem less interesting, because it
seems more natural to examine how attitude varies by sex rather than how
sex varies by attitude. In general, for any two-way table we can
calculate conditional proportions for both the rows and the columns, but
typically only one of them is used for interpretation.
### Conditional distributions and associations {#ss-descr1-2cat-assoc}
Suppose that we regard one variable in a two-way table as the
explanatory variable (let us denote it by $X$) and the other variable as
the response variable ($Y$). In our survey example, sex is thus $X$ and
attitude is $Y$. Here the dichotomous $X$ divides the full sample into
two groups, identified by the observed value of $X$ — men and women. We
may then think of these two groups as two separate samples, and consider
statistical quantities separately for each of them. In particular, in
Table \@ref(tab:t-sex-attitude-row) we calculated conditional proportions for
$Y$ given $X$, i.e. for attitude given sex. These proportions describe
two distinct sample distributions of $Y$, one for men and one for women.
They are examples of *conditional distributions*:
- The **conditional distribution** of a variable $Y$ given another
variable $X$ is the distribution of $Y$ among those units which have
a particular value of $X$.
This concept is not limited to two-way tables but extends also to other
kinds of variables and distributions that are discussed later in this
coursepack. Both the response variable $Y$ and the explanatory variable
$X$ may be continuous as well as discrete, and can have any number of
values. In all such cases there is a separate conditional distribution
for $Y$ for each possible value of $X$. A particular one of these
distributions is sometimes referred to more explicitly as the
conditional distribution of $Y$ given $X=x$, where the “$X=x$” indicates
that $X$ is considered at a particular value $x$ (as in “the
distribution of $Y$ given $X=2$”, say).
Conditional distributions of one variable given another allow us to
define and describe associations between the variables. The informal
definition in Section \@ref(ss-intro-def-assoc) stated that there is an
association between two variables if knowing the value of one of them
will help to predict the value of the other. We can now give a more
precise definition:
- There is an **association** between variables $X$ and $Y$ if the
conditional distribution of $Y$ given $X$ is different for different
values of $X$.
This definition coincides with the more informal one. If the conditional
distribution of $Y$ varies with $X$ and if we know $X$, it is best to
predict $Y$ from its conditional distribution given the known value of
$X$. This will indeed work better than predicting $Y$ without using
information on $X$, i.e. from the marginal distribution of $Y$.
Prediction based on the conditional distribution would still be subject
to error, because in most cases $X$ does not predict $Y$ perfectly. In
other words, the definition of an association considered here is
*statistical* (or *probabilistic*) rather than *deterministic*. In our
example a deterministic association would mean that there is one
response given by all the men and one response (possibly different from
the men’s) given by all the women. This is of course not the case here
nor in most other applications in the social sciences. It is thus
crucially important that we have the tools also to analyse statistical
associations.
In our example, sex and attitude are associated if men and women differ
in their attitudes toward income redistribution. Previous studies
suggest that such an association exists, and that it takes the form that
women tend to have higher levels of support than men for
redistribution.^[See, for example, Svallfors (1997), Words of welfare and attitudes
to redistribution: A comparison of eight western nations, *European
Sociological Review*, 13, 283-304; and Blekesaune and Quadagno
(2003), Public attitudes towards welfare state policies: A
comparative analysis of 24 nations, *European Sociological Review*,
19, 415-427.] As possible explanations for this pattern, both
structural reasons (women tend to have lower incomes than men and to
rely more on welfare state support) and cultural or psychological ones
(women are more likely than men to adopt social values of equality and
caring) have been suggested.
### Describing an association using conditional proportions {#ss-descr1-2cat-descr}
Two variables presented in a contingency table are associated in the
sample if the conditional distributions of one of them vary across the
values of the other. This is the case in our data set: for example, 4.0%
of men but 2.6% of women strongly disagree with the statement. There is
thus some association between sex and attitude in this sample. This much
is easy to conclude. What requires a little more work is a more detailed
description of the pattern and strength of the association, i.e. how and
where the conditional distributions differ from each other.
The most general way of summarising associations in a contingency table
is by comparing the conditional proportions of the same level of the
response given different levels of the explanatory variable. There is no
simple formula for how this should be done, so you should use your
common sense to present comparisons which give a good summary of the
patterns across the table. Unless both variables in the table are
dichotomous, several different comparisons may be needed, and may not
all display similar patterns. For example, in Table
\@ref(tab:t-sex-attitude-row) the same proportion (0.156, or 15.6%) of both
men and women strongly agree with the statement, whereas the proportion
who respond “Agree” is higher for women (49.4%) than for men (42.8%).
When the response variable is ordinal, it is often more illuminating to
focus on comparisons of *cumulative* proportions which add up
conditional proportions over two or more adjacent categories. For
instance, the combined proportion of respondents who either strongly
agree or agree with the statement is a useful summary of the general
level of agreement among the respondents. In our example this is 58.4%
($=15.5\%+42.8\%$) for men but 65.0% for women.
A comparison between two proportions may be further distilled into a
single number by reporting the *difference* or *ratio* between them. For
example, for the proportions of agreeing or strongly agreeing above, the
difference is $0.650-0.584=0.066$, so the proportion is 0.066 (i.e. 6.6
percentage points) higher for women than for men. The ratio of these
proportions is $0.650/0.584=1.11$, so the proportion for women is 1.11
times the proportion for men (i.e. 11% higher). Both of these indicate
that in this sample women were more likely to agree or strongly agree
with the statement than were men. In a particular application we might
report a difference or a ratio like this, depending on which of them was
considered more relevant or easily understandable. Other summaries are
also possible; for example, on MY452 we will discuss a measure called
the *odds ratio*, which turns out to be convenient for more general
methods of analysing associations involving categorical variables.
The broad conclusion in the example is that there is an association
between sex and attitude in these data from the European Social Survey,
and that it is of the kind suggested by existing literature. A larger
proportion of women than of men indicate agreement with the statement
that the government should take measures to reduce income differences,
and conversely larger proportion of men disagree with it (e.g. 23.5% of
men but only 16.8% of women disagree or strongly disagree). Thus in this
sample women do indeed demonstrate somewhat higher levels of support for
income redistribution. Whether these differences also warrant a
generalisation of the conclusions to people outside the sample is a
question which we will take up in Chapters \@ref(c-samples) and
\@ref(c-tables).
### A measure of association for ordinal variables {#ss-descr1-2cat-gamma}
In the previous example the explanatory variable (sex) had 2 categories
and the response variable (attitude) had 5. A full examination of the
individual conditional distributions of attitude given sex then involved
comparisons of five pairs of proportions, one for each level of the
attitude variable. This number gets larger still if the explanatory
variable also has several levels, as in the following example:
*Example: Importance of short-term gains for investors*
Information on the behaviour and expectations of individual investors
was collected by sending a questionnaire to a sample of customers of a
U.S. brokerage house.^[Lewellen, W. G., Lease, R. G., and Schlarbaum, G. G. (1977).
“Patterns of investment strategy and behavior among individual
investors”. *The Journal of Business*, **50**, 296–333. The
published article gave only the total sample size, the marginal
distributions of sex and age group, and conditional proportions for
the short-term gains variable given sex and age group. These were
used to create tables of frequencies separately for men and women
(assuming further that the age distribution was the same for both),
and Table \@ref(tab:t-investors) was obtained by combining these. The
resulting table is consistent with information in the article, apart
from rounding error.] One of the questions asked the respondents to
state how much importance they placed on quick profits (short-term
gains) as an objective when they invested money. The responses were
recorded in four categories as “Irrelevant”, “Slightly important”,
“Important” or “Very important”. Table \@ref(tab:t-investors) shows the
crosstabulation of this variable with the age of the respondent in four
age groups.
-------------------------------------------------------------------------
\ \ Slightly \ Very \
Age group Irrelevant important Important important Total
------------- -------------- ----------- ----------- ----------- --------
Under 45 37 45 38 26 146
(0.253) (0.308) (0.260) (0.178) (1.00)
45–54 111 77 57 37 282
(0.394) (0.273) (0.202) (0.131) (1.00)
55–64 153 49 31 20 253
(0.605) (0.194) (0.123) (0.079) (1.00)
65 and over 193 64 19 15 291
(0.663) (0.220) (0.065) (0.052) (1.00)
Total 494 235 145 98 972
-------------------------------------------------------------------------
: (\#tab:t-investors)Importance of short-term gains: Frequencies of respondents in the investment example, by age group
and attitude towards short-term gains as investment goal. Conditional
proportions of attitude given age group are shown in brackets. The
value of the $\gamma$ measure of association is $-0.377$.
Here there are four conditional distributions, one for each age group,
and each of them is described by four proportions of different levels of
attitude. There are then many possible comparisons of the kind discussed
above. For example, we might want to compare the proportions of
respondents who consider short-term gains irrelevant between the oldest
and the youngest age group, the proportions for whom such gains are very
important between these two groups, or, in general, the proportions in
any category of the response variable between any two age groups.
Although pairwise comparisons like this are important and informative,
they can clearly become cumbersome when the number of possible
comparisons is large. A potentially attractive alternative is then to
try to summarise the strength of the association between the variables
in a single number, a **measure of association** of some kind. There are
many such measures for two-way contingency tables, labelled with a range
of Greek and Roman letters (e.g. $\phi$, $\lambda$, $\gamma$, $\rho$,
$\tau$, V, Q, U and d). The most useful of them are designed for tables
where both of the variables are measured at the ordinal level, as is the
case in Table \@ref(tab:t-investors). The ordering of the categories can then
be exploited to capture the strength of the association in a single
measure. This is not possible when at least one of the variables is
measured at the nominal level, as any attempt to reduce the patterns of
the conditional probabilities into one number will then inevitably
obscure much of the information in the table. It is better to avoid
measures of association defined for nominal variables, and to describe
their associations only through comparisons of conditional probabilities
as described in the previous section.
Here we will discuss only one measure of association for two-way tables
of ordinal variables. It is known as $\gamma$ (“gamma”). It
characterises one possible general pattern of association between two
ordinal variables, namely the extent to which high values of one
variable tend to be associated with high or low values of the other
variable. Here speaking of “low” and “high” values, or of “increasing”
or “decreasing” them, is meaningful when the variables are ordinal. For
example, in Table \@ref(tab:t-investors) the categories corresponding to the
bottom rows and right-most columns are in an obvious sense “high” values
of age and importance respectively.
Consider the conditional proportions of importance given age group shown
in Table \@ref(tab:t-investors). It is clear that, for example, the
proportion of respondents for whom short-term gains are very important
is highest in the youngest, and lowest in the oldest age group.
Similarly, the proportion of respondents for whom such gains are
irrelevant increases consistently from the youngest to the oldest group.
In other words, respondents with *high* values of the explanatory
variable (age group) tend to have *low* values the response variable
(importance of short-term gains). Such an association is said to be
*negative*. A *positive* association would be seen in a table where high
values of one variable were associated with high values of the other.
Measures of association for summarising such patterns are typically
based on the numbers of concordant and discordant pairs of observations.
Suppose we compare two units classified according to the two variables
in the table. These units form a *concordant pair* if one of them has a
higher value of both variables than the other. For example, consider two
respondents in Table \@ref(tab:t-investors), one with values (Under 45;
Irrelevant) and the other with (45–54; Important). This is a concordant
pair, because the second respondent has both a higher value of age group
(45–54 vs. Under 45) and a higher value of the importance variable
(Important vs. Irrelevant) than the first respondent. In contrast, in a
*discordant pair* one unit has a higher value of one variable but a
lower value of the other variable than the other unit. For example, a
pair of respondents with values (45–54; Very important) and (55–64;
Irrelevant) is discordant, because the latter has a higher value of age
group but a lower value of the importance variable than the former.
Pairs of units with the same value of one or both of the variables are
known as *tied* pairs. They are not used in the calculations discussed
below.
The $\gamma$ measure of association is defined as
\begin{equation}
\gamma=\frac{C-D}{C+D}
(\#eq:gamma)
\end{equation} where $C$ is the total number of concordant pairs in the
table, and $D$ is the number of discordant pairs. For Table
\@ref(tab:t-investors), the value of this is $\gamma=-0.377$.
Calculation of $C$ and $D$ is straightforward but tedious and
uninteresting, and can be left to a computer. Remembering the exact form
of (\@ref(eq:gamma)) is also not crucial. More important than the formula of
$\gamma$ (or any other measure of association) is its interpretation.
This can be considered on several levels of specificity, which are
discussed separately below. The discussion is relatively detailed, as
these considerations are relevant and useful not only for $\gamma$, but
also for all other measures of association in statistics.
The **sign** of the statistic: It can be seen from (\@ref(eq:gamma)) that
$\gamma$ is positive (greater than zero) when there are more concordant
pairs than discordant ones (i.e. $C>D$), and negative when there are
more discordant than concordant pairs ($C<D$). This also implies that
$\gamma$ will be positive when the association is positive in the sense
discussed above, and negative when the association is negative. A value
of $\gamma=0$ indicates a complete lack of association of this kind. In
Table \@ref(tab:t-investors) we have $\gamma=-0.377$, indicating a negative
association. This agrees with the conclusion obtained informally above.
The **extreme values** of the statistic: Clearly $\gamma=1$ if there are
no discordant pairs ($D=0$), and $\gamma=-1$ if there are no concordant
pairs ($C=0$). The values $\gamma=-1$ and $\gamma=1$ are the smallest
and largest possible values of $\gamma$, and indicate the strongest
possible levels of negative and positive association respectively. More
generally, the closer $\gamma$ is to $-1$ or 1, the stronger is the
(negative or positive) association.
The **formal interpretation** of the statistic: This refers to any way
of interpreting the value more understandably than just vaguely as a
measure of “strength of association”. Most often, such an intepretation
is expressed as a *proportion* of some kind. For $\gamma$, this is done
using a principle known as **Proportional reduction of error** (PRE).
Because the PRE idea is also used to interpret many other measures of
association in statistics, we will first describe it in general terms
which are not limited to $\gamma$.
Suppose we consider an explanatory variable $X$ and a response variable
$Y$, and want to make predictions of the values of $Y$ in a data set.
This is done twice, first in a way which makes no use of $X$, and then
in a way which predicts the value of $Y$ for each unit using information
on the corresponding value of $X$ and on the strength and direction of
the association between $X$ and $Y$. Recalling the connection between
association and prediction, it is clear that the second approach should
result in better predictions if the two variables are associated. The
comparison also reflects the *strength* of the association: the stronger
it is, the bigger is the improvement in prediction gained by utilising
information on $X$.
A PRE measure describes the size of this improvement. Suppose that the
magnitude or number of errors made in predicting the values of $Y$ in a
data set using the first scheme, i.e. ignoring information on $X$, is
somehow measured by a single number $E_{1}$, and that $E_{2}$ is the
same measure of errors for the second prediction scheme which makes use
of $X$. The difference $E_{1}-E_{2}$ is thus the improvement in
prediction achieved by the second scheme over the first. A PRE measure
of association is the ratio
\begin{equation}
\text{PRE}= \frac{E_{1}-E_{2}}{E_{1}},
(\#eq:PRE)
\end{equation} i.e. the improvement in predictions as a *proportion* of
the number of errors $E_{1}$ under the first scheme. This formulation is
convenient for interpretation, because a proportion is easily
understandable even if $E_{1}$ and $E_{2}$ themselves are expressed in
some unfamiliar units. The smallest possible value of (\@ref(eq:PRE)) is
clearly 0, obtained when $E_{2}=E_{1}$, i.e. when using information on
$X$ gives no improvement in predictions. The largest possible value of
PRE is 1, obtained when $E_{2}=0$, i.e. when $Y$ can be predicted
perfectly from $X$. The values 0 and 1 indicate no association and
perfect association respectively.
The $\gamma$ statistic is a PRE measure, although with a somewhat
convoluted explanation. Suppose that we consider a pair of observations
which is known to be either concordant or discordant (the PRE
interpretation of $\gamma$ ignores tied observations). One of the two
observations thus has a higher value of $X$ than the other. For example,
suppose that we consider two respondents in Table \@ref(tab:t-investors) from
different age groups. We are then asked to predict the *order* of the
values of $Y$, i.e. which of the two units has the higher value of $Y$.
In the example of Table \@ref(tab:t-investors), this means predicting whether
the older respondent places a higher or lower level of importance on
short-term gains than the younger respondent. Two sets of predictions
are again compared. The first approach makes the prediction at random
and with equal probabilities, essentially tossing a coin to guess
whether the observation with the higher value of $X$ has the higher or
lower value of $Y$. The second prediction makes use of information on
the direction of the association between $X$ and $Y$. If the association
is known to be negative (i.e. there are more discordant than concordant
pairs), every pair is predicted to be discordant; if it is positive,
every pair is predicted to be concordant. For example, in Table
\@ref(tab:t-investors) the association is negative, so we would always
predict that the older of two respondents places a lower value of
importance on short-term gains.
If these predictions are repeated for every non-tied pair in the table,
the expected number of incorrect predictions under the first scheme is
$E_{1}=(C+D)/2$. Under the second scheme it is $E_{2}=D$ if the
association is positive and $E_{2}=C$ if it is negative. Substituting
these into the general formula (\@ref(eq:PRE)) shows that the $\gamma$
statistic (\@ref(eq:gamma)) is of the PRE form when $\gamma$ is positive;
when it is negative, the absolute value of $\gamma$ (i.e. its value with
the minus sign omitted) is a PRE measure, and the negative sign of
$\gamma$ indicates that the association is in the negative direction. In
our example $\gamma=-0.377$, so age and attitude are negatively
associated. Its absolute value $0.377$ shows that we will make 37.7%
fewer errors if we predict for every non-tied pair that the older
respondent places less importance on short-term gains, compared to
predictions made by tossing a coin for each pair.
The final property of interest is the **substantive interpretation** of
the strength of association indicated by $\gamma$ for a particular
table. For example, should $\gamma=-0.377$ for Table \@ref(tab:t-investors)
be regarded as evidence of weak, moderate or strong negative association
between age and attitude? Although this is usually the most (or only)
interesting part of the interpretation, it is also the most difficult,
and one to which a statistician’s response is likely to be a firm “it
depends”. This is because the strength of associations we may expect to
observe depends on the variables under consideration: a $\gamma$ of 0.5,
say, might be commonplace for some types of variables but never observed
for others. Considerations of the magnitude of $\gamma$ are most useful
in comparisons of associations between the same two variables in
different samples or groups. For example, in Chapter \@ref(c-3waytables)
we will calculate $\gamma$ for the variables in Table \@ref(tab:t-investors)
separately for men and women (see Table \@ref(tab:t-investors3)). These turn
out to be very similar, so the strength of the association appears to be
roughly similar in these two groups.
Three further observations complete our discussion of $\gamma$:
- Since “high” values of a variable were defined as ones towards the
bottom and right of a table, reversing the order in which the
categories are listed will also reverse the interpretation of “high”
and “low” and of a “negative” or “positive” association. Such a
reversal for one variable will change the sign of $\gamma$ but not
its absolute value. For example, in Table \@ref(tab:t-investors) we could
have listed the age groups from the oldest to the youngest, in which
case we would have obtained $\gamma=0.377$ instead of
$\gamma=-0.377$. Reversing the ordering of both of the variables
will give the same value of $\gamma$ as when neither is reversed.
The nature and interpretation of the association remain unchanged in
each case.
- $\gamma$ can also be used when one or both of the variables are
dichotomous, but not when either is nominal and has more than
two categories. If, for example, the table includes a nominal
variable with four categories, there are 24 different and equally
acceptable ways of ordering the categories, each giving a different
value of $\gamma$ (or rather 12 different positive values and their
negatives). An interpretation of the value obtained for any
particular ordering is then entirely meaningless.
- $\gamma$ can also be treated as an estimate of the corresponding
measure of association in a population from which the observed table
is a sample. To emphasise this, the symbol $\hat{\gamma}$ is
sometimes used for the sample statistic we have discussed here,
reserving $\gamma$ for the population parameter. It is then also
possible to define significance tests and confidence intervals for
the population $\gamma$. These are given, for example, in SPSS
output for two-way tables. Here, however, we will not discuss them,
but will treat $\gamma$ purely as a descriptive measure
of association. Statistical inference on associations for two-way
tables will be considered only in the context of a different test,
introduced in Chapter \@ref(c-tables).
## Sample distributions of a single continuous variable {#s-descr1-1cont}
### Tabular methods {#ss-descr1-1cont-tab}
A table of frequencies and proportions or percentages is a concise and
easily understandable summary of the sample distribution of a
categorical variable or any variable for which only a small number of
different values have been observed. On the other hand, applying the
same idea to a continuous variable or a discrete variable with many
different values is likely to be less useful, because all of the
individual frequencies may be small. For example, in this section we
illustrate the methods using the GDP variable in the country data
introduced at the beginning of Section \@ref(s-descr1-examples). This has 99 different
values among the 155 countries, 66 of these values appear only once, and
the largest frequency (for 0.8) is five. A frequency table of these
values would be entirely unenlightening.
----------------------------------
GDP \ \
(thousands of
dollars) Frequency %
--------------- ----------- ------
less than 2.0 49 31.6
2.0–4.9 32 20.6
5.0–9.9 29 18.7
10.0–19.9 21 13.5
20.0–29.9 19 12.3
30.0 or more 5 3.2
Total 155 99.9
----------------------------------
: (\#tab:t-gdp)Frequency distribution of GDP per capita in the country data.
Instead, we can count the frequencies for some *intervals* of values.
Table \@ref(tab:t-gdp) shows an example of this for the GDP variable. The
frequency on its first line shows that there are 49 countries with GDP
per capita of less than \$2000, the second line that there are 32
countries with the GDP per capita between \$2000 and \$4900 (these
values included), and so on. We have thus in effect first created an
ordinal categorical variable by grouping the original continuous GDP
variable, and then drawn a frequency table of the grouped variable in
the same way as we do for categorical variables. Some information about
the distribution of the original, ungrouped variable will be lost in
doing so, in that the exact values of the observations within each
interval are obscured. This, however, is a minor loss compared to the
benefit of obtaining a useful summary of the main features of the
distribution.
The intervals must be *mutually exclusive*, so that no value belongs to
more than one interval, and *exhaustive*, so that all values in the data
belong to some interval. Otherwise the choice is arbitrary, in that we
can choose the intervals in any way which is sensible and informative.
Often this is a question of finding the right balance between too few
categories (losing too much of the original information) and too many
categories (making the table harder to read).
### Graphical methods {#ss-descr1-1cont-graphs}
#### Histograms {-}
![(\#fig:f-hist-gdp)Histogram of GDP per capita in the country data, together with the
corresponding frequency polygon.](gdp){width="13.5cm"}
A **histogram** is the graphical version of a frequency table for a
grouped variable, like that in Table \@ref(tab:t-gdp). Figure
\@ref(fig:f-hist-gdp) shows a histogram for the GDP variable (the histogram
consists of the bars; the lines belong to a different graph, the
frequency polygon explained below). The basic idea of a histogram is
very similar to that of the bar chart, except that now the bars touch
each other to emphasise the fact that the original (ungrouped) variable
is considered continuous. Because the grouped variable is ordinal, the
bars of a histogram must be in the correct order.
A good choice of the grouping intervals of the variable and thus the
number of bars in the histogram is important for the usefulness of the
graph. If there are too few bars, too much information is obscured; if
too many, the shape of the histogram may become confusingly irregular.
Often the number of intervals used for a histogram will be larger than
what would be sensible for a table like \@ref(tab:t-gdp). Furthermore,
intervals like those in Table \@ref(tab:t-gdp) are not even allowed in a
histogram, because they are of different widths (of 2, 3, 5, 10 and 10
units for the first five, and unbounded for the last one). The intervals
in a histogram must be of equal widths, because otherwise the visual
information in it becomes distorted (at least unless the histogram is
modified in ways not discussed here). For example, the intervals in
Figure \@ref(fig:f-hist-gdp) (less than 2.5, 2.5–less than 5.0, 5.0–less than
7.5 etc.) are all 2.5 units wide. The exact choice can usually be left
to computer packages such as SPSS which use automatic rules for choosing
sensible intervals.
#### Frequency polygons {-}
Figure \@ref(fig:f-hist-gdp) also shows a **frequency polygon** of the GDP
variable. This is obtained by drawing lines to connect the mid-points of
the tops of the bars in a histogram. At each end of the histogram the
lines are further connected to zero, as if the histogram had additional
bars of zero height to the left and right of the smallest and largest
observed categories. The result is a curve with a similar shape as the
corresponding histogram, and its interpretation is similar to that of
the histogram.
A histogram is usually preferable to a frequency polygon for presenting
a single distribution, especially since histograms are typically much
easier to produce in standard software such as SPSS. However, frequency
polygons will later be useful for making comparisons between several
distributions.
#### Stem and leaf plots {-}