-
Notifications
You must be signed in to change notification settings - Fork 7
/
03-MY451-samples.Rmd
459 lines (397 loc) · 26.1 KB
/
03-MY451-samples.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
# Samples and populations {#c-samples}
## Introduction {#s-samples-intro}
So far we have discussed statistical description, which is concerned
with summarizing features of a sample of observed data. From now on,
most of the attention will be on statistical inference. As noted in
Section \@ref(ss-intro-def-descr), the purpose of inference is to draw
conclusions about the characteristics of some larger population based on
what is observed in a sample. In this chapter we will first give more
careful definitions of the concepts of populations and samples, and of
the connections between them. In Section \@ref(s-samples-popdistrs) we
then consider the idea of a population distribution, which is the target
of statistical inference. The discussion of statistical inference will
continue in Chapters \@ref(c-tables)–\@ref(c-means) where we gradually
introduce the basic elements of inference in the contexts of different
types of analyses.
## Finite populations {#s-samples-finpops}
In many cases the population of interest is a particular group of real
people or other units. Consider, for example, the European Social Survey
(ESS) which we used in Chapter \@ref(c-descr1) (see early in Section \@ref(s-descr1-examples)).^[European Social Survey (2012). ESS Round 5 (2010/2011) Technical
Report. London: Centre for Comparative Social Surveys, City
University London. See <http://www.europeansocialsurvey.org> for more on
the ESS.] The ESS is a cross-national survey carried
out biennially in around 30 European countries. It is an
academically-driven social survey which is designed to measure a wide
range attitudes, beliefs and behaviour patterns among the European
population, especially for purposes for cross-national comparisons.
The target population of ESS is explicitly stated as being “all persons
aged 15 and over resident within private households, regardless of their
nationality, citizenship, language or legal status” in each of the
participating countries. This is, once “private household” has been
defined carefully, and notwithstanding the inevitable ambiguity in that
the precise number and composition of households are constantly
changing, a well-defined, existing group. It is also a large group: in
the UK, for example, there are around 50 million such people.
Nevertheless, we have no conceptual difficulty with imagining this
collection of individuals. We will call any such population a *finite
population*.
The main problem with studying a large finite population is that it is
usually not feasible to collect data on all of its members. A **census**
is a study where some variables *are* in fact measured for the entire
population. The best-known example is the Census of Population, which at
least aims to be a complete evaluation of all persons living in a
country on a particular date with respect to basic demographic data.
Similarly, we have the Census of Production, Census of Distribution etc.
For most research, however, a census is not feasible. Even when one is
attempted, it is rarely truly comprehensive. For example, all population
censuses which involve collecting the data from the people themselves
end up missing a substantial (and non-random) proportion of the
population. For most purposes a well-executed sample of the kind
described below is actually preferable to an unsuccessful census.
## Samples from finite populations {#s-samples-samples}
When a census is not possible, information on the population is obtained
by observing only a subset of units from it, i.e. a sample. This is
meant to be *representative* of the population, so that we can
*generalise* findings from the sample to the population. To be
representative in a sense appropriate for statistical inference, a
sample from a finite population must be a *probability sample*, obtained
using
- **probability sampling**: a sampling method where every unit in the
population has a **known**, **non-zero** probability of being
selected to the sample.
Probability sampling requires first a **sampling frame**, essentially
one or more lists of units or collections of units which make it
possible to select and contact members of the sample. For example, the
first stage of sampling for many UK surveys uses the Postcode Address
File, a list of postal addresses in the country. A **sampling design**
is then created in such a way that it assigns a **sampling probability**
for each unit, and the sample is drawn so that each unit’s probability
of being selected into the sample is given by their sampling
probability. The selection of the specific set of units actually
included in the sample thus involves *randomness*, usually implemented
with the help of random number generators on computers.
The simplest form of probability sampling is
- **simple random sampling**, where every unit in the population has
the *same* probability of selection.
This requirement of equal selection probabilities is by no means
essential. Other probability sampling methods which relax it include
- **stratified sampling**, where the selection probabilities are set
separately for different groups (*strata*) in the population, for
example separately for men and women, different ethnic groups or
people living in different regions.
- **cluster sampling**, where the units of interest are not sampled
individually but in groups (*clusters*). For example, a school
survey might involve sampling entire classes and then interviewing
every pupil in each selected class.
- **multistage sampling**, which employs a sequence of steps, often
with a combination of stratification, clustering and simple
random sampling. For example, many social surveys use a *multistage
area sampling* design which begins with one or more stages of
sampling areas, then households (addresses) within selected small
areas, and finally individuals within selected households.
These more complex sampling methods are in fact used for most
large-scale social surveys to improve their accuracy and/or
cost-efficiency compared to simple random sampling. For example, the UK
component of the European Social Survey uses a design of three stages:
(1) a stratified sample of postcode sectors, stratified by region, level
of deprivation, percentage of privately rented households, and
percentage of pensioners; (2) simple random sample of addresses within
the selected sectors; and (3) simple random sample of one adult from
each selected address.
Some analyses of such data require the use of *survey weights* to adjust
for the fact that some units were more likely than others to end up in
the sample. The questions of how and when the weights should be used
are, however, beyond the scope of this course. Here we will omit the
weights even in examples where they might normally be used.^[For more on survey weights and the design and analysis of surveys
in general, please see MY456 (Survey Methodology) in the Lent Term.]
Not all sampling methods satisfy the requirements of probability
sampling. Such techniques of **non-probability sampling** include
- *purposive sampling*, where the investigator uses his or her own
“expert” judgement to select units considered to be representative
of the population. It is very difficult to do this well, and very
easy to introduce conscious or unconscious biases into
the selection. In general, it is better to leave the task to the
random processes of probability sampling.
- *haphazard* or *convenience* sampling, as when a researcher simply
uses the first $n$ passers-by who happen to be available and willing
to answer questions. One version of this is *volunteer* sampling,
familiar from call-in “polls” carried out by morning television
shows and newspapers on various topics of current interest. All we
learn from such exercises are the opinions of those readers or
viewers who felt strongly enough about the issue to send in their
response, but these tell us essentially nothing about the average
attitudes of the general population.
- *quota sampling*, where interviewers are required to select a
certain number (quota) of respondents in each of a set of categories
(defined, for example, by sex, age group and income group). The
selection of specific respondents within each group is left to the
interviewer, and is usually done using some (unstated) form of
purposive or convenience sampling. Quota sampling is quite common,
especially in market research, and can sometimes give
reasonable results. However, it is easy to introduce biases in the
selection stage, and almost impossible to know whether the resulting
sample is a representative one.
A famous example of the dangers of non-probability sampling is the
survey by the *Literary Digest* magazine to predict the results of the
1936 U.S. presidential election. The magazine sent out about 10 million
questionnaires on post cards to potential respondents, and based its
conclusions on those that were returned. This introduced biases in at
least two ways. First, the list of those who were sent the questionnaire
was based on registers such as the subscribers to the magazine, and of
people with telephones, cars and various club memberships. In 1936 these
were mainly wealthier people who were more likely to be Republican
voters, and the typically poorer people not on the source lists had no
chance of being included. Second, only about 25% of the questionnaires
were actually returned, effectively rendering the sample into a
volunteer sample. The magazine predicted that the Republican candidate
Alf Landon would receive 57% of the vote, when in fact his Democratic
opponent F. D. Roosevelt gained an overwhelming victory with 62% of the
vote. The outcome of the election was predicted correctly by a much
smaller probability sample collected by George Gallup.
A more recent example is the “GM Nation” public consultation exercise on
attitudes to genetically modified (GM) agricultural products, carried
out in the U.K. in 2002–3.^[For more information, see Gaskell, G. (2004). “Science policy and
society: the British debate over GM agriculture”, *Current Opinion
in Biotechnology* **15**, 241–245.] This involved various activities,
including national, regional and local events where interested members
of the public were invited to take part in discussions on GM foods. At
all such events the participants also completed a questionnaire, which
was also available on the GM Nation website. In all, around 37000 people
completed the questionnaire, and around 90% of those expressed
opposition to GM foods. While the authors of the final report of the
consultation drew some attention to the unrepresentative nature of this
sample, this fact had certainly been lost by the time the results were
reported in the national newspapers as “5 to 1 against GM crops in
biggest ever public survey”. At the same time, probability samples
suggested that the British public is actually about evenly split between
supporters and opponents of GM foods.
## Conceptual and infinite populations {#s-samples-infpops}
Even a cursory inspection of academic journals in the social sciences
will reveal that a finite population of the kind discussed above is not
always clearly defined, nor is there often any reference to probability
sampling. Instead, the study designs may for example resemble the
following two examples:
*Example: A psychological experiment*\
Fifty-nine undegraduate students from a large U.S. university took part
in a psychological experiment, either as part of a class project or
for extra credit on a psychology course.^[Experiment 1 in Anderson, C. A., Carnagey, N. L., and Eubanks,
J. (2003). “Exposure to violent media: the effects of songs with
violent lyrics on aggressive thoughts and feelings”. *Journal of
Personality and Social Psychology* **84**, 960–971.] The participants were randomly
assigned to listen to one of two songs, one with clearly violent lyrics
and one with no violent content. One of the variables of interest was a
measure (from a 35-item attitude scale) of state hostility
(i.e. temporary hostile feelings), obtained after the participants had
listened to a song, and the researchers were interested in comparing
levels of hostility between the two groups.
*Example: Voting in a congressional election*\
A political-science article considered the U.S. congressional
election which took place between June 1862 and November 1863,
i.e. during a crucial period in the American Civil War.^[Carson, J. L. et al. (2001). “The impact of national tides and
district-level effects on electoral outcomes: the U.S. congressional
elections of 1862–63”. *American J. of Political Science* **45**,
887–898.] The units of
analysis were the districts in the House of Representatives. One part of
the analysis examined whether the likelihood of the candidate of the
Republican Party (the party of the sitting president Abraham Lincoln)
being elected from a district was associated with such explanatory
variables as whether the Republican was the incumbent, a measure of the
quality of the other main candidate, number of military casualties for
the district, and the timing of the election in the district (especially
in relation to the Union armies’ changing fortunes over the period).
There is no reference here to the kinds of finite populations and
probability samples discussed Sections \@ref(s-samples-finpops) and
\@ref(s-samples-samples). In the experiment, the participants were a
convenience sample of respondents easily available to the researcher,
while in the election study the units represent (nearly) all the
districts in a single (and historically unique) election. Yet both
articles contain plenty of statistical inference, so the language and
concepts of samples and populations are clearly being used. How is this
to be justified?
In the example of the psychological experiment the subjects will clearly
not be representative of a general (non-student) population in many
respects, e.g. in age and education level. However, it is not really
such characteristics that the study is concerned with, nor is the
population of interest really a population of people. Instead, the
implicit “population” being considered is that of possible values of
level of hostility after a person has listened to one of the songs in
the experiment. In this extended framework, these possible values
include not just the levels of hostitility possibly obtained for
different people, but also those that a single person might have after
listening to the song at different times or in different moods etc. The
generalisation from the observed data in the experiment is to this
hypothetical population of possible reactions.
In the political science example the population is also a hypothetical
one, namely those election results that *could* have been obtained if
something had happened differently, i.e. if different people turned up
to vote, if some voters had made different decisions, and so on (or if
we considered a different election in the same conditions, although that
is less realistic in this example, since other elections have not taken
place in the middle of a civil war). In other words, votes that actually
took place are treated as a sample from the population of votes that
could conceivably have taken place.
In both cases the “population” is in some sense a hypothetical or
conceptual one, a population of possible realisations of events, and the
data actually observed are a sample from that population. Sometimes it
is useful to apply similar thinking even to samples from ostensibly
quite finite populations. Any such population, say the residents of a
country, is exactly fixed at one moment only, and was and will be
slightly different at any other time, or would be even now if any one of
a myriad of small events had happened slightly differently in the past.
We could thus view the finite population itself at a single moment as a
sample from a conceptual population of possible realisations. This is
known in survey literature as a *superpopulation*. The data actually
observed are then also a sample from the superpopulation. With this
extension, it is possible to regard almost any set of data as a sample
from some conceptual superpopulation.
The highly hypothetical notion of a conceptual population of possible
events is clearly going to be less easy both to justify and to
understand than the concept of a large but finite population of real
subjects defined in Section \@ref(s-samples-finpops). If you find the
whole idea distracting, you can focus in your mind on the more
understandable latter case, at least if you are willing to believe that
the idea of a conceptual population is also meaningful. Its main
justification is that much of the time it works, in the sense that
useful decision rules and methods of analysis are obtained based on the
idea. Most of the motivation and ideas of statistical inference are
essentially the same for both kinds of populations.
Even when the idea of a conceptual population is invoked, questions of
representativeness of and generalisability to real, finite populations
will still need to be kept in mind in most applications. For example,
the assumption behind the psychological experiment described above is
that the findings about how hearing a violent song affects levels of
hostility are generalisable to some larger population, beyond the 59
participants in the experiment and beyond the body of students in a
particular university. This may well be the case at least to some
extent, but it is still open to questioning. For this reason findings
from studies like this only become really convincing when they are
*replicated* in comparable experiments among different kinds of
participants.
Because the kinds of populations discussed in this section are
hypothetical, there is no sense of them having a particular fixed number
of members. Instead, they are considered to be *infinite* in size. This
also implies (although it may not be obvious why) that we can
essentially always treat samples from such populations as if they were
obtained using simple random sampling.
## Population distributions {#s-samples-popdistrs}
We will introduce the idea of a population distribution first for finite
populations, before extending it to infinite ones. The discussion in
this section focuses on categorical variables, because the concepts are
easiest to explain in that context; generalisations to continuous
variables are discussed in Chapter \@ref(c-means).
Suppose that we have drawn a sample of $n$ units from a finite
population and determined the values of some variables for them. The
units that are not in the sample also possess values of the variables,
even though these are not observed. We can thus easily imagine how any
of the methods which were in Chapter \@ref(c-descr1) used to describe a
sample could also be applied in the same way to the whole population, if
only we knew all the values in it. In particular, we can, paralleling
the sample distribution of a variable, define the **population
distribution** as the set of values of the variable which appear in the
population, together with the frequencies of each value.
For illustration, consider again the example introduced early in Section \@ref(s-descr1-examples). The two variables there are a person’s sex and
his or her attitude toward income redistribution. We have observed them
for a sample $n=2344$ people drawn from the population of all UK
residents aged 15 or over. The sample distributions are summarised by
Table \@ref(tab:t-attitude).
--------------------------------------------------------------------------------------
\ Agree \ Neither agree \ Disagree \
Sex strongly Agree nor disagree Disagree strongly Total
-------- --------------- ----------- --------------- ---------- ---------- -----------
Male 3.84 10.08 4.56 4.32 1.20 24.00
(16.00) (42.00) (19.00) (18.00) (5.00) (100)
[7.68] [20.16] [9.12] [8.64] [2.40] [48.00]
Female 4.16 13.00 4.68 3.38 0.78 26.00
(16.00) (50.00) (18.00) (13.00) (3.00) (100)
[8.32] [26.00] [9.36] [6.76] [1.56] [52.00]
Total 8.00 23.08 9.24 7.70 1.98 50
(16.00) (46.16) (18.48) (15.40) (3.96) (100)
--------------------------------------------------------------------------------------
: (\#tab:t-sex-attitude-pop)*``The government should take measures to reduce differences in income levels''*: Attitude towards income redistribution by sex, in a hypothetical
population of 50 million people. The numbers in the table are
frequencies in millions of people, row percentages (in parentheses)
and overall percentages in square brackets.
Imagine now that the full population consisted of 50 million people, and
that the values of the two variables for them were as shown in Table
\@ref(tab:t-sex-attitude-pop). The frequencies in this table desribe the
population distribution of the variables in this hypothetical
population, with the joint distribution of sex and attitude shown by the
internal cells of the table and the marginal distributions by its
margins. So there are for example 3.84 million men and 4.16 million
women in the population who strongly agree with the attitude statement,
and 1.98 million people overall who strongly disagree with it.
Rather than the frequencies, it is more helpful to discuss population
distributions in terms of proportions. Table \@ref(tab:t-sex-attitude-pop)
shows two sets of them, the overall proportions in square brackets
out of the total population size, and the two rows of conditional
proportions of attitude given sex (in parentheses). Either of these can
be used to introduce the ideas of population distributions, but we focus
on the conditional proportions because they will be more convenient for
the discussion in later chapters. In this population we observe, for
example, that the conditional proportion of “Strongly disagree” given
that a person is a woman is 0.03, i.e. 3% of women strongly disagree
with the statement, while among men the corresponding conditional
proportion is 0.05.
Instead of “proportions”, when we discuss population distributions we
will usually talk of “probabilities”. The two terms are equivalent when
the population is finite and the variables are categorical, as in Table
\@ref(tab:t-sex-attitude-pop), but the language of probabilities is more
appropriate in other cases. We can then say that Table
\@ref(tab:t-sex-attitude-pop) shows two sets of **conditional probabilities**
in the population, which define two conditional **probability
distributions** for attitude given sex.
The notion of a probability distribution creates a conceptual connection
between population distributions and sampling from them. This is that
the probabilities of the population distribution can also be thought of
as sampling probabilities in (simple random) sampling from the
population. For example, here the conditional probability of “Strongly
disagree” among men is 0.05, while the probability of “Strongly agree”
is 0.16. The sampling interpretation of this is that if we sample a man
at random from the population, the probability is 0.05 that he strongly
disagrees and 0.16 that he strongly agrees with the attitude statement.
The view of population distributions as probability distributions works
also in other cases than the kind that is illustrated by Table
\@ref(tab:t-sex-attitude-pop). First, it applies also for continuous
variables, where proportions of individual values are less useful (this
is discussed further in Chapter \@ref(c-means)). Second, it is also
appropriate when the population is regarded as an infinite
superpopulation, in which case the idea of population *frequencies* is
not meaningful. With this device we have thus reached a formulation of a
population distribution which is flexible enough to cover all the
situations where we will need it.
## Need for statistical inference {#s-samples-inference}
We have now introduced the first key concepts that are involved in
statistical inference:
- The population, which may regarded as finite or infinite.
Distributions of variables in the population are the population
distributions, which are formulated as probability distributions of
the possible values of the variables.
- Random samples from the population, and sample distributions of
variables in the sample.
Substantive research questions are most often questions about population
distributions. This raises the fundamental challenge of inference: what
we are interested in — the population — is not fully observed, while
what we do observe — the sample — is not of main interest for itself.
The sample is, however, what information we do have to draw on for
conclusions about the population. Here a second challenge arises:
because of random variation in the sampling, sample distributions will
not be identical to population distributions, so inference will not be
as simple as concluding that whatever is true of the sample is also true
of the population. Something cleverer is needed to weigh the evidence in
the sample, and that something is statistical inference.
The next three chapters are mostly about statistical inference. Each of
them discusses a particular type of analysis and inferential and
decriptive statistical methods for it. These methods are some of the
most commonly used in basic statistical analyses of empirical data. In
addition, we will also use them as contexts in which to introduce the
general concepts of statistical inference. This will be done gradually,
with each chapter both building on previous concepts and introducing new
ones, as follows:
- Chapter \@ref(c-tables): Associations in two-way contingency tables
(significance testing, sampling distributions of statistics).
- Chapter \@ref(c-probs): Single proportions and comparisons of
proportions (probability distributions, parameters, point
estimation, confidence intervals).
- Chapter \@ref(c-means): Means of continuous variables (probability
distributions of continuous variables, and inference for
such variables).