forked from OpenIntroStat/ims
-
Notifications
You must be signed in to change notification settings - Fork 0
/
23-inference-applications.Rmd
693 lines (554 loc) · 34.3 KB
/
23-inference-applications.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
# Applications: Infer {#inference-applications}
```{r, include = FALSE}
source("_common.R")
```
## Recap: Computational methods {#comp-methods-summary}
The computational methods we have presented are used in two settings.
First, in many real life applications (as in those covered here), the mathematical model and computational model give identical conclusions.
When there are no differences in conclusions, the advantage of the computational method is that it gives the analyst a good sense for the logic of the statistical inference process.
Second, when there is a difference in the conclusions (seen primarily in methods beyond the scope of this text), it is often the case that the computational method relies on fewer technical conditions and is therefore more appropriate to use.
### Randomization
The important feature of randomization tests is that the data is permuted in such a way that the null hypothesis is true.
The randomization distribution provides a distribution of the statistic of interest under the null hypothesis, which is exactly the information needed to calculate a p-value --- where the p-value is the probability of obtaining the observed data or more extreme when the null hypothesis is true.
Although there are ways to adjust the randomization for settings other than the null hypothesis being true, they are not covered in this book and they are not used widely.
In approaching research questions with a randomization test, be sure to ask yourself what the null hypothesis represents and how it is that permuting the data is creating different possible null data representations.
**Hypothesis tests.** When using a randomization test, we proceed as follows:
- Write appropriate hypotheses.
- Compute the observed statistic of interest.
- Permute the data repeatedly, each time, recalculating the statistic of interest.
- Compute the proportion of times the permuted statistics are as extreme as or more extreme than the observed statistic, this is the p-value.
- Make a conclusion based on the p-value, and write the conclusion in context and in plain language so anyone can understand the result.
\clearpage
### Bootstrapping
Bootstrapping, in contrast to randomization tests, represents a proxy sampling of the original population.
With bootstrapping, the analyst is not forcing the null hypothesis to be true (or false, for that matter), but instead, they are replicating the variability seen in taking repeated samples from a population.
Because there is no underlying true (or false) null hypothesis, bootstrapping is typically used for creating confidence intervals for the parameter of interest.
Bootstrapping can be used to test particular values of a parameter (e.g., by evaluating whether a particular value of interest is contained in the confidence interval), but generally, bootstrapping is used for interval estimation instead of testing.
**Confidence intervals.** The following is how we generally computed a confidence interval using bootstrapping:
- Repeatedly resample the original data, with replacement, using the same sample size as the original data.
- For each resample, calculate the statistic of interest.
- Calculate the confidence interval using one of the following methods:
- Bootstrap percentile interval: Obtain the endpoints representing the middle (e.g., 95%) of the bootstrapped statistics.
The endpoints will be the confidence interval.
- Bootstrap standard error (SE) interval: Find the SE of the bootstrapped statistics.
The confidence interval will be given by the original observed statistic plus or minus some multiple (e.g., 2) of SEs.
- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.
\vspace{-4mm}
## Recap: Mathematical models {#math-models-summary}
The mathematical models which have been used to produce inferential analyses follow a consistent framework for different parameters of interest.
As a way to contrast and compare the mathematical approach, we offer the following summaries in Tables \@ref(tab:zcompare) and \@ref(tab:tcompare).
\vspace{-4mm}
### z-procedures
Generally, when the response variable is categorical (or binary), the summary statistic is a proportion and the model used to describe the proportion is the standard normal curve (also referred to as a $z$-curve or a $z$-distribution).
We provide Table \@ref(tab:zcompare) partly as a mechanism for understanding $z$-procedures and partly to highlight the extremely common usage of the $z$-distribution in practice.
\vspace{-2mm}
```{r zcompare}
zsim_table <- tribble(
~variable, ~col1, ~col2,
"Response variable", "Binary", "Binary",
"Parameter of interest", "Proportion: $p$", "Difference in proportions: $p_1 - p_2$",
"Statistic of interest", "Proportion: $\\hat{p}$", "Difference in proportions: $\\hat{p}_1 - \\hat{p}_2$",
"Standard error: HT", "$\\sqrt{\\frac{p_0(1-p_0)}{n}}$", "$\\sqrt{\\hat{p}_{pool}\\bigg(1-\\hat{p}_{pool}\\bigg)\\bigg(\\frac{1}{n_1} + \\frac{1}{n_2}}\\bigg)$",
"Standard error: CI", "$\\sqrt{\\frac{\\hat{p}(1-\\hat{p})}{n}}$", "$\\sqrt{\\frac{\\hat{p}_{1}(1-\\hat{p}_{1})}{n_1} + \\frac{\\hat{p}_{2}(1-\\hat{p}_{2})}{n_2}}$",
"Conditions", "1. Independence, 2. Success-failure", "1. Independence, 2. Success-failure"
)
zsim_table %>%
kbl(linesep = "\\addlinespace", booktabs = TRUE, caption = "Similarities of $z$-methods across one and two independent samples analysis of a binary response variable.",
col.names = c("", "One sample ", "Two independent samples"),
escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"), full_width = TRUE) %>%
column_spec(1, width = "10em")
```
\clearpage
**Hypothesis tests.** When applying the $z$-distribution for a hypothesis test, we proceed as follows:
- Write appropriate hypotheses.
- Verify conditions for using the $z$-distribution.
- One-sample: the observations (or differences) must be independent. The success-failure condition of at least 10 success and at least 10 failures should hold.
- For a difference of proportions: each sample must separately satisfy the success-failure conditions, and the data in the groups must also be independent.
- Compute the point estimate of interest and the standard error.
- Compute the Z score and p-value.
- Make a conclusion based on the p-value, and write a conclusion in context and in plain language so anyone can understand the result.
**Confidence intervals.** Similarly, the following is how we generally computed a confidence interval using a $z$-distribution:
- Verify conditions for using the $z$-distribution. (See above.)
- Compute the point estimate of interest, the standard error, and $z^{\star}.$
- Calculate the confidence interval using the general formula:\
point estimate $\pm\ z^{\star} SE.$
- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.
### t-procedures
With quantitative response variables, the $t$-distribution was applied as the appropriate mathematical model in three distinct settings.
Although the three data structures are different, their similarities and differences are worth pointing out.
We provide Table \@ref(tab:tcompare) partly as a mechanism for understanding $t$-procedures and partly to highlight the extremely common usage of the $t$-distribution in practice.
```{r tcompare}
tsim_table <- tribble(
~variable, ~col1, ~col2, ~col3,
"Response variable", "Numeric", "Numeric", "Numeric",
"Parameter of interest", "Mean: $\\mu$", "Paired mean: $\\mu_{diff}$", "Difference in means: $\\mu_1 - \\mu_2$",
"Statistic of interest", "Mean: $\\bar{x}$", "Paired mean: $\\bar{x}_{diff}$", "Difference in means: $\\bar{x}_1 - \\bar{x}_2$",
"Standard error", "$\\frac{s}{\\sqrt{n}}$", "$\\frac{s_{diff}}{\\sqrt{n_{diff}}}$", "$\\sqrt{\\frac{s_1^2}{n_1} + \\frac{s_2^2}{n_2}}$",
"Degrees of freedom", "$n-1$", "$n_{diff} -1$", "$\\min(n_1 -1, n_2 - 1)$",
"Conditions", "1. Independence, 2. Normality or large samples", "1. Independence, 2. Normality or large samples", "1. Independence, 2. Normality or large samples"
)
tsim_table %>%
kbl(linesep = "\\addlinespace", booktabs = TRUE, caption = "Similarities of $t$-methods across one sample, paired sample, and two independent samples analysis of a numeric response variable.",
col.names = c("", "One sample ", "Paired sample", "Two independent samples"),
escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"), full_width = TRUE) %>%
column_spec(1, width = "10em")
```
\clearpage
**Hypothesis tests.** When applying the $t$-distribution for a hypothesis test, we proceed as follows:
- Write appropriate hypotheses.
- Verify conditions for using the $t$-distribution.
- One-sample or differences from paired data: the observations (or differences) must be independent and nearly normal. For larger sample sizes, we can relax the nearly normal requirement, e.g., slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and strong skew for sample sizes of 60.
- For a difference of means when the data are not paired: each sample mean must separately satisfy the one-sample conditions for the $t$-distribution, and the data in the groups must also be independent.
- Compute the point estimate of interest, the standard error, and the degrees of freedom For $df,$ use $n-1$ for one sample, and for two samples use either statistical software or the smaller of $n_1 - 1$ and $n_2 - 1.$
- Compute the T score and p-value.
- Make a conclusion based on the p-value, and write a conclusion in context and in plain language so anyone can understand the result.
**Confidence intervals.** Similarly, the following is how we generally computed a confidence interval using a $t$-distribution:
- Verify conditions for using the $t$-distribution. (See above.)
- Compute the point estimate of interest, the standard error, the degrees of freedom, and $t^{\star}_{df}.$
- Calculate the confidence interval using the general formula:\
point estimate $\pm\ t_{df}^{\star} SE.$
- Put the conclusions in context and in plain language so even non-statisticians and data scientists can understand the results.
## Case study: Redundant adjectives {#case-study-redundant-adjectives}
Take a look at the images in Figure \@ref(fig:blue-triangle-shapes).
How would you describe the circled item in the top image (A)?
Would you call it "the triangle"?
Or "the blue triangle"?
How about in the bottom image (B)?
Does your answer change?
```{r blue-triangle-shapes, fig.cap = "Two sets of four shapes. In A, the circled triangle is the only triangle. In B, the circled triangle is the only blue triangle.", fig.alt = "Four shapes are presented twice. In A (the top presentation) the shapes and colors are all different: pink circle, yellow square, red diamond, blue triangle. In B (the bottom presentation) the colors are all different but the traingle shape is repeated: pink circle, yellow square, red triangle, blue triangle. In each of the two presentations the blue triangle is circled."}
shape_names <- c(
"circle filled",
"square filled",
"diamond filled",
"triangle filled",
"circle filled",
"square filled",
"triangle filled",
"triangle filled"
)
shapes <- data.frame(
shape_names = shape_names,
figure = c(rep(1, 4), rep(2, 4)),
x = rep(1:4, 2),
y = 1,
color = rep(c(IMSCOL["pink", "full"], IMSCOL["yellow", "full"],
IMSCOL["red", "full"], IMSCOL["blue", "full"]), 2)
)
p1 <- ggplot(shapes %>% filter(figure == 1), aes(x, y)) +
geom_point(aes(shape = shape_names, color = color, fill = color), size = 20) +
scale_shape_identity() +
scale_color_identity() +
scale_fill_identity() +
theme_void() +
expand_limits(x = c(0.5, 4.5)) +
annotate("point", x = 4, y = 1, shape = "circle open", color = "black", size = 40)
p2 <- ggplot(shapes %>% filter(figure == 2), aes(x, y)) +
geom_point(aes(shape = shape_names, color = color, fill = color), size = 20) +
scale_shape_identity() +
scale_color_identity() +
scale_fill_identity() +
theme_void() +
expand_limits(x = c(0.5, 4.5)) +
annotate("point", x = 4, y = 1, shape = "circle open", color = "black", size = 40)
p1 / p2 +
plot_annotation(tag_levels = "A")
```
In the top image in Figure \@ref(fig:blue-triangle-shapes) the circled item is the only triangle, while in the bottom image the circled item is one of two triangles.
While in the top image "the triangle" is a sufficient description for the circled item, many of us might choose to refer to it as the "blue triangle" anyway.
In the bottom image there are two triangles, so "the triangle" is no longer sufficient, and to describe the circled item we must qualify it with the color as well, as "the blue triangle".
Your answers to the above questions might be different if you're answering in a different language than English.
For example, in Spanish, the adjective comes after the noun (e.g., "el triángulo azul") therefore the incremental value of the additional adjective might be different for the top image.
Researchers studying frequent use of redundant adjectives (e.g., referring to a single triangle as "the blue triangle") and incrementality of language processing designed an experiment where they showed the following two images to 22 native English speakers (undergraduates from University College London) and 22 native Spanish speakers (undergraduates from the Universidad de las Islas Baleares).
They found that in both languages, the subjects used more redundant color adjectives in denser displays where it would be more efficient.
[@rubio-fernandez2021]
```{r fig.cap = "Images used in one of the experiments described in [@rubio-fernandez2021].", fig.alt = "Two presentations of shapes. In each presentation all of the shapes and their colors are unique. In the left presentation, the blue triangle is circled and is one of four shapes. In the right presentation, the blue triangle is circled and is one of sixteen shapes."}
knitr::include_graphics("images/redundant-adjectives-blue-triangle.png")
```
In this case study we will examine data from redundant adjective study, which the authors have made available on Open Science Framework at [osf.io/9hw68](https://osf.io/9hw68/).
```{r}
# analysis based on
# https://osf.io/fqnms/ > Production data > AnalyzeProduction.R
# one row per question
redundant_individual <- read_csv("data/ENGLISH-SPANISH PRODUCTION.csv") %>%
clean_names() %>%
mutate(
items = if_else(question > 10, 16, 4),
items = as.factor(items),
adjective = if_else(color_response == 1, "redundant", "not redundant")
) %>%
select(-color_response)
# one row per individual
redundant <- redundant_individual %>%
group_by(language, subject, items) %>%
summarise(
n_questions = n(),
redundant_perc = sum(adjective == "redundant") * 100 / n_questions,
.groups = "drop"
)
```
Table \@ref(tab:redundant-data) shows the top six rows of the data.
The full dataset has `r nrow(redundant)` rows.
Remember that there are a total of 44 subjects in the study (22 English and 22 Spanish speakers).
There are two rows in the dataset for each of the subjects: one representing data from when they were shown an image with 4 items on it and the other with 16 items on it.
Each subject was asked 10 questions for each type of image (with a different layout of items on the image for each question).
The variable of interest to us is `redundant_perc`, which gives the percentage of questions the subject used a redundant adjective to identify "the blue triangle".
```{r redundant-data}
redundant %>%
slice_head(n = 6) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "Top six rows of the data collected in the study.",
align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = FALSE) %>%
column_spec(1:3, width = "5em") %>%
column_spec(4:5, width = "8em")
```
### Exploratory analysis
In one of the images shown to the subjects, there are 4 items, and in the other, there are 16 items.
In each of the images the circled item is the only triangle, therefore referring to it as "the blue triangle" or as "el triángulo azul" is considered redundant.
If the subject's response was "the triangle", they were recorded to have not used a redundant adjective.
If the response was "the blue triangle", they were recorded to have used a redundant adjective.
Figure \@ref(fig:reduntant-bar) shows the results of the experiment.
We can see that English speakers are more likely than Spanish speakers to use redundant adjectives, and that in both languages, subjects are more likely to use a redundant adjective when there are more items in the image (i.e. in a denser display).
```{r reduntant-bar, fig.cap = "Results of redundant adjective usage experiment from [@rubio-fernandez2021]. English speakers are more likely than Spanish speakers to use redundant adjectives, regardless of number of items in image. For both images, respondents are more likely to use a redundant adjective when there are more items in the image."}
redundant_summary <- redundant %>%
group_by(language, items) %>%
summarise(mean_redundant_perc = mean(redundant_perc), .groups = "drop")
redundant_summary %>%
ggplot(aes(x = items, y = mean_redundant_perc , fill = language))+
geom_col(position = "dodge")+
scale_y_continuous(labels = label_percent(scale = 1)) +
labs(
x = "Number of items in image",
y = "Percentage of\nredundant adjective usage"
) +
scale_fill_manual(values = c(IMSCOL["blue", "full"], IMSCOL["red", "full"]))
```
These values are also shown in Table \@ref(tab:redundant-table).
```{r redundant-table}
redundant_summary %>%
kbl(linesep = "", booktabs = TRUE,
col.names = c("Language", "Number of items", "Percentage redundant"),
caption = caption_helper("Summary of redundant adjective usage experiment from study."),
align = "lrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = FALSE) %>%
column_spec(1, width = "10em")
```
### Confidence interval for a single mean
```{r}
xbar_eng_4 <- redundant_summary %>%
filter(language == "English", items == 4) %>%
pull(mean_redundant_perc) %>%
round(2)
```
In this experiment, the average percentage of redundant adjective usage among subjects who responded in English when presented with an image with 4 items in it is `r xbar_eng_4`.
Along with the sample average as a point estimate, however, we can construct a confidence interval for the true mean redundant adjective usage of English speakers who use redundant color adjectives when describing items in an image that is not very dense.
```{r boot-eng-4-construct}
set.seed(74)
boot_eng_4 <- redundant %>%
filter(language == "English", items == 4) %>%
specify(response = redundant_perc) %>%
generate(1000, type = "bootstrap") %>%
calculate(stat = "mean")
ci_eng_4 <- boot_eng_4 %>%
get_confidence_interval(level = 0.95)
ci_eng_4_lower <- as.numeric(ci_eng_4)[1]
ci_eng_4_upper <- as.numeric(ci_eng_4)[2]
```
Using a computational method, we can construct the interval via bootstrapping.
Figure \@ref(fig:boot-eng-4-viz) shows the distribution of 1,000 bootstrapped means from this sample.
The 95% confidence interval (that is calculated by taking the 2.5th and 97.5th percentile of the bootstrap distribution is `r round(ci_eng_4_lower,1)`% to `r round(ci_eng_4_upper,1)`%.
Note that this interval for the true population parameter is only valid if we can assume that the sample of English speakers are representative of the population of all English speakers.
```{r boot-eng-4-viz, fig.cap = "(ref:boot-eng-4-viz-cap)"}
boot_eng_4 %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 5, fill = IMSCOL["green", "full"]) +
annotate("line",
x = c(ci_eng_4_lower, ci_eng_4_lower),
y = c(0, 250),
color = IMSCOL["green", "f2"], size = 1) +
annotate("line",
x = c(ci_eng_4_upper, ci_eng_4_upper),
y = c(0, 250),
color = IMSCOL["green", "f2"], size = 1) +
annotate("rect",
xmin = ci_eng_4_lower, xmax = ci_eng_4_upper,
ymin = 0, ymax = 250,
alpha = 0.3, fill = IMSCOL["green", "full"]) +
labs(
x = "Mean redundant adjective usage percentage",
y = "Count",
title = "1,000 bootstrap means"
) +
scale_x_continuous(labels = label_percent(scale = 1))
```
(ref:boot-eng-4-viz-cap) Distribution of 1,000 bootstrapped means of redundant adjective usage percentage among English speakers who were shown four items in images. Overlaid on the distribution is the 95% bootstrap percentile interval that ranges from 19.1% to 56.4%.
Using a similar technique, we can also construct confidence intervals for the true mean redundant adjective usage percentage for English speakers who are shown dense (16 item) displays and for Spanish speakers with both types (4 and 16 items) displays.
However, these confidence intervals are not very meaningful to compare to one another as the interpretation of the "true mean redundant adjective usage percentage" is quite an abstract concept.
Instead, we might be more interested in comparative questions such as "Does redundant adjective usage differ between dense and sparse displays among English speakers and among Spanish speakers?" or "Does redundant adjective usage differ between English speakers and Spanish speakers?" To answer either of these questions we need to conduct a hypothesis test.
### Paired mean test
```{r}
redundant_paired <- redundant %>%
filter(language == "English") %>%
select(-language, -n_questions) %>%
pivot_wider(id_cols = subject, names_from = items, names_prefix = "redundant_perc_",
values_from = redundant_perc) %>%
mutate(diff_redundant_perc = redundant_perc_16 - redundant_perc_4)
redundant_paired_mean <- redundant_paired %>%
summarize(mean(diff_redundant_perc)) %>%
pull() %>%
round(2)
```
Let's start with the following question: "Do the data provide convincing evidence of a difference in mean redundant adjective usage percentages between sparse (4 item) and dense (16 item) displays for English speakers?" Note that the English speaking participants were each evaluated on both the 4 item and the 16 item displays.
Therefore, the variable of interest is the difference in redundant percentage.
The statistic of interest will be the average of the differences, here $\bar{x}_{diff} =$ `r redundant_paired_mean`.
Data from the first six English speaking participants are seen in Table \@ref(tab:redundant-data-paired).
Although the redundancy percentages seem higher in the 16 item task, a hypothesis test will tell us whether the differences observed in the data could be due to natural variability.
```{r redundant-data-paired}
redundant_paired %>%
slice_head(n = 6) %>%
kbl(linesep = "", booktabs = TRUE,
caption = "Six participants who speak English with redundancy difference.",
align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = c("striped", "hold_position"),
full_width = FALSE) %>%
column_spec(1, width = "5em") %>%
column_spec(2:4, width = "12em")
```
\clearpage
We can answer the research question using a hypothesis test with the following hypotheses:
$$H_0: \mu_{diff} = 0$$
$$H_A: \mu_{diff} \ne 0$$
where $\mu_{diff}$ is the true difference in redundancy percentages when comparing a 16 item display with a 4 item display.
Recall that the computational method used to assess a hypothesis pertaining to the true average of a paired difference shuffles the observed percentage across the two groups (4 item vs 16 item) but **within** a single participant.
The shuffling process allows for repeated calculations of potential sample differences under the condition that the null hypothesis is true.
Figure \@ref(fig:eng-viz) shows the distribution of 1,000 mean differences from redundancy percentages permuted across the two conditions.
Note that the distribution is centered at 0, since the structure of randomly assigning redundancy percentages to each item display will balance the data out such that the average of any differences will be zero.
```{r eng-viz, fig.cap = "Distribution of 1,000 mean differences of redundant adjective usage percentage among English speakers who were shown images with 4 and 16 items. Overlaid on the distribution is the observed average difference in the sample (solid line) as well as the difference in the other direction (dashed line), which is far out in the tail, yielding a p-value that is approximately 0."}
set.seed(74)
null_eng <- redundant_paired %>%
specify(response = diff_redundant_perc) %>%
hypothesize(null = "point", mu = 0) %>%
generate(1000, type = "bootstrap") %>%
calculate(stat = "mean")
obs_stat_eng <- redundant_paired %>%
specify(response = diff_redundant_perc) %>%
calculate(stat = "mean") %>%
as.numeric()
null_eng %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 5, fill = IMSCOL["green", "full"]) +
annotate(
"line",
x = c(obs_stat_eng, obs_stat_eng),
y = c(0, 200),
color = IMSCOL["red", "full"], size = 1
) +
annotate(
"line",
x = c(-obs_stat_eng, -obs_stat_eng),
y = c(0, 200),
color = IMSCOL["red", "full"], size = 1,
linetype = "dashed"
) +
labs(
x = "Mean difference in redundant adjective usage percentages\n(16 items - 4 items)",
y = "Count",
title = "1,000 randomized mean difference"
) +
scale_x_continuous(labels = label_percent(scale = 1))
```
With such a small p-value, we reject the null hypothesis and conclude that the data provide convincing evidence of a difference in mean redundant adjective usage percentages across different displays for English speakers.
### Two independent means test
Finally, let's consider the question "How does redundant adjective usage differ between English speakers and Spanish speakers?" The English speakers are independent from the Spanish speakers, but since the same subjects were shown the two types of displays, we can't combine data from the two display types (4 objects and 16 objects) together while maintaining independence of observations.
Therefore, to answer questions about language differences, we will need to conduct two hypothesis tests, one for sparse displays and the other for dense displays.
In each of the tests, the hypotheses are as follows:
$$H_0: \mu_{English} = \mu_{Spanish}$$
$$H_A: \mu_{English} \ne \mu_{Spanish}$$
Here, the randomization process is slightly different than the paired setting (because the English and Spanish speakers do not have a natural pairing across the two groups).
To answer the research question using a computational method, we can use a randomization test where we permute the data across all participants under the assumption that the null hypothesis is true (no difference in mean redundant adjective usage percentages across English vs Spanish speakers).
```{r}
# 4 item
set.seed(74)
null_4 <- redundant %>%
filter(items == 4) %>%
specify(response = redundant_perc, explanatory = language) %>%
hypothesize(null = "independence") %>%
generate(1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("English", "Spanish"))
obs_stat_4 <- redundant %>%
filter(items == 4) %>%
specify(response = redundant_perc, explanatory = language) %>%
calculate(stat = "diff in means", order = c("English", "Spanish")) %>%
as.numeric()
pval_4 <- null_4 %>%
get_p_value(obs_stat = obs_stat_4, direction = "both") %>%
as.numeric()
# 16 item
set.seed(74)
null_16 <- redundant %>%
filter(items == 16) %>%
specify(response = redundant_perc, explanatory = language) %>%
hypothesize(null = "independence") %>%
generate(1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("English", "Spanish"))
obs_stat_16 <- redundant %>%
filter(items == 16) %>%
specify(response = redundant_perc, explanatory = language) %>%
calculate(stat = "diff in means", order = c("English", "Spanish")) %>%
as.numeric()
pval_16 <- null_16 %>%
get_p_value(obs_stat = obs_stat_16, direction = "both") %>%
as.numeric()
```
Figure \@ref(fig:compare-lang-viz) shows the null distributions for each of these hypothesis tests.
The p-value for the 4 item display comparison is very small (`r pval_4`) while the p-value for the 16 item display is much larger (`r pval_16`).
```{r compare-lang-viz, fig.cap = "Distributions of 1,000 differences in randomized means of redundant adjective usage percentage between English and Spanish speakers. Plot A shows the differences in 4 item displays and Plot B shows the differences in 16 item displays. In each plot, the observed differences in the sample (solid line) as well as the differences in the other direction (dashed line) are overlaid.", fig.asp = 1}
p_4 <- null_4 %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 5, fill = IMSCOL["green", "full"]) +
annotate(
"line",
x = c(obs_stat_4, obs_stat_4),
y = c(0, 300),
color = IMSCOL["red", "full"], size = 1
) +
annotate(
"line",
x = -1 *c(obs_stat_4, obs_stat_4),
y = c(0, 300),
color = IMSCOL["red", "full"], size = 1,
linetype = "dashed"
) +
labs(
x = "Diffence in mean redundant adjective usage percentages\n(English - Spanish)",
y = "Count",
title = "4 item display"
) +
scale_x_continuous(labels = label_percent(scale = 1))
p_16 <- null_16 %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 5, fill = IMSCOL["green", "full"]) +
annotate(
"line",
x = c(obs_stat_16, obs_stat_16),
y = c(0, 200),
color = IMSCOL["red", "full"], size = 1
) +
annotate(
"line",
x = -1 *c(obs_stat_16, obs_stat_16),
y = c(0, 200),
color = IMSCOL["red", "full"], size = 1,
linetype = "dashed"
) +
labs(
x = "Diffence in mean redundant adjective usage percentages\n(English - Spanish)",
y = "Count",
title = "16 item display"
) +
scale_x_continuous(labels = label_percent(scale = 1))
p_4 / p_16 +
plot_annotation(
title = "1,000 differences in randomized means",
tag_levels = "A"
)
```
Based on the p-values (a measure of deviation from the null claim), we can conclude that the data provide convincing evidence of a difference in mean redundant adjective usage percentages between languages in 4 item displays (small p-value) but not in 16 item displays (not small p-value).
The results suggests that language patterns around redundant adjective usage might be more similar for denser displays than sparser displays.
\clearpage
## Interactive R tutorials {#inference-tutorials}
Navigate the concepts you've learned in this chapter in R using the following self-paced tutorials.
All you need is your browser to get started!
::: {.alltutorials data-latex=""}
[Tutorial 5: Statistical inference](https://openintrostat.github.io/ims-tutorials/05-infer/)\
```{asis, echo = knitr::is_latex_output()}
https://openintrostat.github.io/ims-tutorials/05-infer
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 1: Inference for a single proportion](https://openintro.shinyapps.io/ims-05-infer-01/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-01
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 2: Hypothesis tests to compare proportions](https://openintro.shinyapps.io/ims-05-infer-02/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-02
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 3: Chi-squared test of independence](https://openintro.shinyapps.io/ims-05-infer-03/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-03
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 4: Chi-squared goodness of fit Test](https://openintro.shinyapps.io/ims-05-infer-04/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-04
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 5: Bootstrapping for estimating a parameter](https://openintro.shinyapps.io/ims-05-infer-05/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-05
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 6: Introducing the t-distribution](https://openintro.shinyapps.io/ims-05-infer-06/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-06
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 7: Inference for difference in two means](https://openintro.shinyapps.io/ims-05-infer-07/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-07
```
:::
::: {.singletutorial data-latex=""}
[Tutorial 5 - Lesson 8: Comparing many means](https://openintro.shinyapps.io/ims-05-infer-08/)\
```{asis, echo = knitr::is_latex_output()}
https://openintro.shinyapps.io/ims-05-infer-08
```
:::
```{asis, echo = knitr::is_latex_output()}
You can also access the full list of tutorials supporting this book at\
[https://openintrostat.github.io/ims-tutorials](https://openintrostat.github.io/ims-tutorials).
```
```{asis, echo = knitr::is_html_output()}
You can also access the full list of tutorials supporting this book [here](https://openintrostat.github.io/ims-tutorials).
```
\vspace{-3mm}
## R labs {#inference-labs}
Further apply the concepts you've learned in this part in R with computational labs that walk you through a data analysis case study.
::: {.singlelab data-latex=""}
[Inference for categorical responses - Texting while driving](https://www.openintro.org/go?id=ims-r-lab-infer-1)\
```{asis, echo = knitr::is_latex_output()}
https://www.openintro.org/go?id=ims-r-lab-infer-1
```
:::
\vspace{-2mm}
::: {.singlelab data-latex=""}
[Inference for numerical responses - Youth Risk Behavior Surveillance System](https://www.openintro.org/go?id=ims-r-lab-infer-2)\
```{asis, echo = knitr::is_latex_output()}
https://www.openintro.org/go?id=ims-r-lab-infer-2
```
:::
```{asis, echo = knitr::is_latex_output()}
You can also access the full list of labs supporting this book at\
[https://www.openintro.org/go?id=ims-r-labs](https://www.openintro.org/go?id=ims-r-labs).
```
```{asis, echo = knitr::is_html_output()}
You can also access the full list of labs supporting this book [here](https://www.openintro.org/go?id=ims-r-labs).
```