This repository has been archived by the owner on Oct 12, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
tidyverse-list-columns.Rmd
236 lines (181 loc) · 7.37 KB
/
tidyverse-list-columns.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "Analysis results in a consistent data structure"
author: "Florian Privé"
date: "October 21, 2017"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = "center", out.width = "75%", fig.asp = 0.75, cache = TRUE)
options(width = 85)
```
How do you store results of an analysis?
When you have different parameters that are varying, when you compare many methods and when you want to keep all the results of an analysis, your code can become quite complex.
In this tutorial, we make a comparison of machine learning methods for predicting disease based on small SNP data. We show how to use the tidyverse set of packages to make the analysis easier by using consistent data structures and functional programming. We use tibbles with [list-columns](http://r4ds.had.co.nz/many-models.html).
## A simulation made easy
### Data
Let's use some data from [Gad Abraham's GitHub repository](https://github.com/gabraham/flashpca/tree/master/HapMap3) which have been filtered and quality-controled.
```{r}
library(tidyverse)
tmpfile <- tempfile()
base <- paste0(
"https://github.com/gabraham/flashpca/raw/master/HapMap3/",
"1kg.ref.phase1_release_v3.20101123_thinned_autosomal_overlap")
exts <- c(".bed", ".bim", ".fam")
walk2(paste0(base, exts), paste0(tmpfile, exts), ~download.file(.x, .y))
```
```{r}
library(bigsnpr)
rdsfile <- snp_readBed(paste0(tmpfile, ".bed"))
bigsnp <- snp_attach(rdsfile)
G <- bigsnp$genotypes
G[which(is.na(G[]))] <- 0 # quick and dirty imputation
obj.svd <- snp_autoSVD(G, bigsnp$map$chromosome, bigsnp$map$physical.pos, ncores = 2)
# Matrix of genotypes and first 10 PCs
X <- cbind(G[], obj.svd$u)
```
### Methods' functions
Each method's function returns a tibble (data frame) with 4 columns:
1. the name of the method,
2. the predictive scores and true phenotypes for the test set, as a [list-column](http://r4ds.had.co.nz/many-models.html),
3. the timing of the main computations (in seconds),
4. the number of SNPs used for the prediction.
```{r}
logit <- function(X, y, ind.train, ind.test) {
timing <- system.time({
train <- glmnet::cv.glmnet(X[ind.train, ], y[ind.train], type.measure = "auc",
family = "binomial", alpha = 0.5)
preds <- predict(train, X[ind.test, ])
})[3]
tibble(
method = "logit",
eval = list(cbind(preds, y[ind.test])),
timing = timing,
nb.preds = sum(glmnet::coef.cv.glmnet(train) != 0)
)
}
```
```{r}
svm <- function(X, y, ind.train, ind.test) {
timing <- system.time({
train <- LiblineaR::LiblineaR(X[ind.train, ], y[ind.train], type = 5)
preds <- predict(train, X[ind.test, ], decisionValues = TRUE)
})[3]
tibble(
method = "svm",
eval = list(cbind(1 - preds$decisionValues[, 1], y[ind.test])),
timing = timing,
nb.preds = sum(train$W != 0)
)
}
```
```{r}
xgb <- function(X, y, ind.train, ind.test) {
timing <- system.time({
train <- xgboost::xgboost(data = X[ind.train, ], label = y[ind.train],
nrounds = 10, max_depth = 3, eta = 1,
objective = "binary:logistic",
verbose = FALSE, save_period = NULL)
preds <- predict(train, X[ind.test, ])
})[3]
tibble(
method = "xgb",
eval = list(cbind(preds, y[ind.test])),
timing = timing,
nb.preds = nrow(xgboost::xgb.importance(model = train))
)
}
```
### Simulation of phenotypes
```{r}
get_pheno <- function(
G, ## matrix of genotypes
h2, ## heritability
M, ## nbs of causal variants
effects.dist = c("gaussian", "laplace"), ## distribution of effects
K = 0.3 ## prevalence
) {
effects.dist <- match.arg(effects.dist)
set <- sample(ncol(G), size = M)
effects <- `if`(effects.dist == "gaussian",
rnorm(M, sd = sqrt(h2 / M)),
rmutil::rlaplace(M, s = sqrt(h2 / (2*M))))
y.simu <- scale(G[, set]) %*% effects
y.simu <- y.simu / sd(y.simu) * sqrt(h2)
stopifnot(all.equal(drop(var(y.simu)), h2))
y.simu <- y.simu + rnorm(nrow(G), sd = sqrt(1 - h2))
# Liability Threshold Model (LTM)
pheno <- as.numeric(y.simu > qnorm(1 - K))
}
```
### Simulations
First, we use `expand.grid()` to create a data frame with all possible combination of parameters we compare.
```{r}
results <- expand.grid(
M = c(30, 3000),
dist = c("gaussian", "laplace"),
num.simu = 1:10,
stringsAsFactors = FALSE
) %>%
as_tibble() %>%
print()
```
Then, we use package {purrr} to add a new column to the data frame with all the results. The main condition for a data frame is that each column has the same number of elements. So, by using lists, we can basically store any types of data in a data frame. For example, in column `"eval"` we store matrices, this is called a list-column. In the newly-created column `"res"`, we store data frames! We use tibbles instead of standard data frames because they print better.
```{r, cache=TRUE}
results[["res"]] <- pmap(results, function(M, dist, num.simu) {
cat(".")
y <- get_pheno(X, 0.8, M, dist)
n <- nrow(X)
ind.train <- sort(sample(n, size = 500))
ind.test <- setdiff(1:n, ind.train)
bind_rows(
logit(X, y, ind.train, ind.test),
svm(X, y, ind.train, ind.test),
xgb(X, y, ind.train, ind.test)
)
})
```
```{r}
results
```
### Results
First, we use function `unnest()` of package {tidyr} so that we have a tidier data frame. To learn more about *tidy data*, please refer to [this vignette](http://tidyr.tidyverse.org/articles/tidy-data.html) and [this book chapter](http://r4ds.had.co.nz/tidy-data.html).
```{r}
(results <- unnest(results, res))
```
To compute the AUC (a measure of prediction accuracy), we use {purrr} again and specify the type of the output we want (`_dbl` for doubles). Here, we could use function `mutate()` of package {dplyr} instead of standard accessor `[[`.
```{r}
results[["AUC"]] <- map_dbl(results[["eval"]], ~ bigstatsr::AUC(.x[, 1], .x[, 2]))
results
```
Now, let's get some summaries of the results with package dplyr.
```{r}
results %>%
group_by(M, dist, method) %>%
summarise_at(c("timing", "nb.preds", "AUC"), mean)
```
### Visualise results
Finally, let's use package {ggplot2}, also part of the tidyverse, to visualize results.
```{r}
# Wrapper to use my own theme
myggplot <- function(...) ggplot(...) + bigstatsr::theme_bigstatsr()
```
```{r}
myggplot(results) +
geom_boxplot(aes(method, timing)) +
scale_y_continuous(breaks = seq(0, 20, 2)) +
facet_grid(M ~ dist) +
ggtitle("Timings")
```
```{r}
myggplot(results) +
geom_hline(yintercept = 0.5, color = "red", linetype = 2) +
geom_boxplot(aes(method, AUC)) +
scale_y_continuous(breaks = 0:10 / 10) +
facet_grid(M ~ dist) +
ggtitle("Predictive performance")
```
## Conclusion
The main point of using tibbles to store results is that you have the inputs, raw outputs and transformed outputs in the same data structure. Moreover, this data structure contains information that you can easily manipulate and visualize. Finally, note that we did not use a single loop in the whole method comparison.
The tidyverse set of packages is very convenient and consistent. You can learn more by reading [book R for Data Science](http://r4ds.had.co.nz/), freely available online.