forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
13b-ContinuousRelationshipsInR.Rmd
189 lines (133 loc) · 5.2 KB
/
13b-ContinuousRelationshipsInR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
output:
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
pdf_document: default
html_document: default
---
# Modeling continuous relationships in R
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(fivethirtyeight)
library(BayesMed)
library(cowplot)
library(knitr)
set.seed(123456) # set random seed to exactly replicate results
# load the NHANES data library
library(NHANES)
# drop duplicated IDs within the NHANES dataset
NHANES <-
NHANES %>%
dplyr::distinct(ID,.keep_all=TRUE)
NHANES_adult <-
NHANES %>%
drop_na(Weight) %>%
subset(Age>=18)
```
## Computing covariance and correlation (Section \@ref(covariance-and-correlation))
Let's first look at our toy example of covariance and correlation. For this example, we first start by generating a set of X values.
```{r}
df <-
tibble(x = c(3, 5, 8, 10, 12))
```
Then we create a related Y variable by adding some random noise to the X variable:
```{r echo=FALSE}
df <- df %>%
mutate(y = x + round(rnorm(n = 5, mean = 0, sd = 2)))
```
We compute the deviations and multiply them together to get the crossproduct:
```{r echo=FALSE}
df <- df %>%
mutate(
y_dev = y - mean(y),
x_dev = x - mean(x),
crossproduct = y_dev * x_dev)
```
And then we compute the covariance and correlation:
```{r}
results_df <- tibble(
covXY=sum(df$crossproduct) / (nrow(df) - 1),
corXY= sum(df$crossproduct) /
((nrow(df) - 1) * sd(df$x) * sd(df$y)))
kable(results_df)
```
## Hate crime example
Now we will look at the hate crime data from the `fivethirtyeight` package. First we need to prepare the data by getting rid of NA values and creating abbreviations for the states. To do the latter, we use the `state.abb` and `state.name` variables that come with R along with the `match()` function that will match the state names in the `hate_crimes` variable to those in the list.
```{r}
hateCrimes <-
hate_crimes %>%
mutate(state_abb = state.abb[match(state,state.name)]) %>%
drop_na(avg_hatecrimes_per_100k_fbi, gini_index)
# manually fix the DC abbreviation
hateCrimes$state_abb[hateCrimes$state=="District of Columbia"] <- 'DC'
```
```{r echo=FALSE}
# perform correlation test on hate crime data
corr_results <- cor.test(
hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index,
alternative='greater')
corr_results
```
Remember that we can also compute the p-value using randomization. To to this, we shuffle the order of one of the variables, so that we break the link between the X and Y variables --- effectively making the null hypothesis (that the correlation is less than or equal to zero) true. Here we will first create a function that takes in two variables, shuffles the order of one of them (without replacement) and then returns the correlation between that shuffled variable and the original copy of the second variable.
```{r echo=FALSE}
# create a function to compute the
# correlation on the shuffled values
shuffleCorr <- function(x, y) {
# randomly reorder x
xShuffled <- sample(x)
# compute and return correlation
# with shuffled variable
return(tibble(cor=cor(xShuffled, y)))
}
# run this function 5000 times
input_df <- tibble(id=seq(5000))
shuffleDist <- input_df %>%
group_by(id) %>%
do(shuffleCorr(
hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index))
```
Now we take the distribution of observed correlations after shuffling and compare them to our observed correlation, in order to obtain the empirical probability of our observed data under the null hypothesis.
```{r}
mean(shuffleDist$cor >corr_results$estimate )
```
This value is fairly close (though a bit larger) to the one obtained using `cor.test()`.
## Robust correlations (Section \@ref(robust-correlations))
In the previous chapter we also saw that the hate crime data contained one substantial outlier, which appeared to drive the significant correlation. To compute the Spearman correlation, we first need to convert the data into their ranks, which we can do using the `order()` function:
```{r}
hateCrimes <- hateCrimes %>%
mutate(hatecrimes_rank = order(avg_hatecrimes_per_100k_fbi),
gini_rank = order(gini_index))
```
We can then compute the Spearman correlation by applying the Pearson correlation to the rank variables"
```{r}
cor(hateCrimes$hatecrimes_rank,
hateCrimes$gini_rank)
```
We see that this is much smaller than the value obtained using the Pearson correlation on the original data. We can assess its statistical signficance using randomization:
```{r echo=FALSE}
# create a function to compute the rank
# correlation on the shuffled values
shuffleCorrRank <- function(x, y) {
# randomly reorder x
xShuffled <- sample(x)
# compute and return rank correlation
# with shuffled variable
xrank = order(xShuffled)
yrank = order(y)
return(tibble(cor=cor(xrank, yrank)))
}
# run this function 5000 times
input_df <- tibble(id=seq(5000))
shuffleDist <- input_df %>%
group_by(id) %>%
do(shuffleCorrRank(
hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index))
mean(shuffleDist$cor >corr_results$estimate )
```
Here we see that the p-value is substantially larger and far from significance.