forked from VanAndelInstitute/cBioPortalProject
-
Notifications
You must be signed in to change notification settings - Fork 0
/
cBioPortalMedulloblastoma.Rmd
594 lines (424 loc) · 20.9 KB
/
cBioPortalMedulloblastoma.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
---
title: "cBioPortal: Medulloblastoma"
author: "Tim Triche, Jr."
date: "12/8/2021"
output:
html_document:
keep_md: true
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Automating cBioPortal queries
## An unpackaged Rmarkdown example
This document pulls some data from [cBioPortal](https://cbioportal.org/) via `cgdsr`,
and performs a bit of analysis on it. (A separate example exists for slides,
as with a 5-minute presentation.) Let's begin by looking at medulloblastoma,
a mostly-pediatric tumor of the medulla oblongata, or lower brainstem. We will
pull a small study from [cBioPortal](https://cBioPortal.org) and then see how the
estimate of co-mutation odds for two genes hold up in a much larger study.
## Medulloblastoma background information
[A recent paper from Volker Hovestadt](https://www.nature.com/articles/s41586-019-1434-6)
provides some more details on the features of these tumors, which have (like most
childhood brain cancers) been an object of intense study (including at VAI). Of
note, Hedgehog signaling, Wnt signaling, and a constellation of alterations lumped
together via DNA methylation profiling as "Group 3" and "Group 4" define the WHO
subtypes of medulloblastoma among patients thus far characterized.
## cBioPortal data via the `cgdsr` package
Let's pull some data from cBioPortal using the [cgdsr](https://cran.r-project.org/web/packages/cgdsr/index.html) package and
see what studies are available for this disease.
The first step is to create a CGDS (Cancer Genome Data Server)
object to manage cBioPortal queries.
<details>
<summary>Load required libraries</summary>
```{r, loadLibraries}
library(cgdsr)
library(tidyverse)
library(kableExtra)
library(pheatmap)
library(janitor)
```
</details>
```{r, cgdsr}
mycgds <- CGDS("http://www.cbioportal.org/")
show(mycgds)
```
That's not terribly useful. Let's ask the cBioPortal object what studies we can query.
```{r, studies}
studies <- getCancerStudies(mycgds)
glimpse(studies)
```
Much better. It looks like there are 333 studies to choose from at the moment.
## Fetching medulloblastoma studies
Let's narrow our field down to the ones that involve medulloblastoma. When we
loaded the `tidyverse` package, it also loaded `stringr`, which gives us a way
to filter data frames or tibbles based on whether a column matches a string.
```{r, medulloStudies}
getCancerStudies(mycgds) %>%
filter(str_detect(name, "Medulloblastoma")) %>%
select(cancer_study_id, name) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
It would be nice to find out how many samples are in each study. It turns out that
this can be found in the `description` column via a bit of chicanery (i.e., throwing
out anything in the description that isn't a number). No guarantees that this will
always work for other studies -- use this at your own risk! As a bonus, let's order
the studies from smallest to largest.
```{r, studySizes}
getCancerStudies(mycgds) %>%
filter(str_detect(name, "Medulloblastoma")) %>%
mutate(n = as.integer(str_extract(description, "[0-9]+"))) %>%
select(cancer_study_id, n, name) %>%
arrange(n) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
This is encouraging. The Hovestadt paper I mentioned above (where the DKFZ folks
looked at single-cell data from human and mouse) is a follow-up to one with 491
cases! (Most of the time, the Germans have the most samples. The big exception
to this is TARGET liquid tumors: we take pride in dwarfing their cohorts.)
## Fetching case lists to collate mutations and other aberrations
For the sake of simplicity, let's start with the Pediatric Cancer Genome Project (PCGP)
medulloblastoma study and its 37 subjects. `getCaseLists(CGDS, cancerStudy)` does that:
```{r, getMutationData}
mbl_study <- "mbl_pcgp"
# the IDs of the cases in the PCGP MBL study
getCaseLists(mycgds, cancerStudy=mbl_study) %>%
filter(case_list_name == "Samples with mutation data") ->
mbl_caselists
# grab the list of lesions
getGeneticProfiles(mycgds, mbl_study) %>%
filter(genetic_profile_name == "Mutations") %>%
pull(genetic_profile_id) ->
mbl_mutations_profile
# a few "greatest hits" mutations seen in MBL
mbl_genes <- c("CTNNB1", "DDX3X", "FAP", "SF3B1", "IFIT3", "TP53")
# get the mutations data and tidy it up
get_muts <- function(x, genes, ...) {
muts <- getProfileData(x, genes, ...)
is.na(muts) <- (muts == "NaN")
muts[is.na(muts)] <- 0
muts[muts != 0] <- 1
rn <- rownames(muts)
muts <- data.frame(lapply(muts, as.integer))
rownames(muts) <- rn
return(muts[, genes])
}
# We throw out the nature of the mutation here, which is rarely a wise idea (if ever)
muts <- get_muts(mycgds,
mbl_genes,
geneticProfiles=mbl_mutations_profile,
caseList=mbl_caselists$case_list_id)
muts %>%
filter(rowSums(.) > 0) %>%
t() %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
That's rather a lot to digest even if we do toss out the cases without detected
mutations. Perhaps we'd be well served by plotting the data instead.
## Plotting (some of) the results from our query
You can also embed plots, for example a sort-of-oncoprint of this study:
```{r, oncoprint, echo=FALSE}
# the cgdsr plot function sucks
# let's use pheatmap instead
pheatmap(t(data.matrix(muts)), col=c("white", "darkred"), cluster_rows=FALSE,
clustering_distance_cols="manhattan", clustering_method="ward.D2", legend=FALSE)
```
This is rather informative, in that one can often pick up patterns of mutual
exclusivity between groups of subjects and groups of genes. In the case of
medulloblastoma, the WHO subgroups of the tumor are _Wnt_ (primarily driven by
beta-catenin, aka _CTNNB1_, or _APC_ mutations), _Shh_ (primarily driven by _PTCH1_
or sometimes structural variants), _Group 3_, and _Group 4_, where the latter don't
have one defining class of driver mutation. We note that in this small 2012 study,
it is hard to tell which of the latter is which (later resolved by DNA methylation).
(The WHO classification of medulloblastoma came about in 2016, for what it's worth.)
Based on our earlier plot, we might wonder whether the beta-catenin gene and
the _DDX3X_ X-linked helicase gene are co-mutated. We can test this similarly
to our previous adventures in mutual exclusivity with IDH1, IDH2, TET2, and WT1:
```{r, CTNNB1_and_DDX3X}
message("Chi-squared p-value:", appendLF = FALSE)
muts %>%
tabyl(CTNNB1, DDX3X) %>%
chisq.test(simulate.p.value = TRUE) %>%
getElement("p.value")
message("Fisher's exact test p-value:", appendLF = FALSE)
muts %>%
tabyl(CTNNB1, DDX3X) %>%
fisher.test(simulate.p.value = TRUE) %>%
getElement("p.value")
```
# Power, sample size, and variance
## More is not better, better is better...
I once spent a summer in Italy, eating much smaller meals but still feeling full.
Italians (or Genovese, at least) are fond of reminding Americans that _di più non è meglio_;
_meglio è meglio. ("More is not better; better is better.") And from a statistical
standpoint, this is usually true: a small, well-executed survey or experiment can
beat the pants off of a huge, poorly-executed survey or experiment. (Xiao-Li Meng
at Harvard has dubbed this _the curse of Big Data_, although it's really more of
a curse of crappy and/or perversely incentivized experiments). Nevertheless...
## ...but more of the same is often good enough
If we *can* get *more* observations and at the same time *better* observations,
it would be rather silly not to. Doubly so when it's free! So let's do that.
Our earlier observations of co-operating mutations seemed _highly significant_!
And in fact they're striking in the figure (see the importance of figures?).
But will they replicate when we look at the same genes in a bigger study?
(Don't peek at the Northcott paper if you can help it. It's more fun this way.)
```{r, Wnt_and_Ddx_revisited}
northcott <- "mbl_dkfz_2017"
getCaseLists(mycgds, northcott) %>%
filter(case_list_name == "Samples with mutation data") ->
northcott_caselists
# grab the list of lesions
getGeneticProfiles(mycgds, northcott) %>%
filter(genetic_profile_name == "Mutations") %>%
pull(genetic_profile_id) ->
northcott_mutations_profile
# grab the mutation matrix for the genes as before
northcott_muts <- get_muts(mycgds, # as before
mbl_genes, # as before
northcott_mutations_profile,
northcott_caselists$case_list_id)
# out of curiosity, how many of each mutation do we see here?
colSums(northcott_muts)
```
Looks like we're all set to replicate our previous results.
```{r, replicatingResults}
message("Chi-squared p-value (Northcott):", appendLF = FALSE)
northcott_muts %>%
tabyl(CTNNB1, DDX3X) %>%
chisq.test(simulate.p.value = TRUE) %>%
getElement("p.value")
message("Fisher's exact test p-value (Northcott):", appendLF = FALSE)
northcott_muts %>%
tabyl(CTNNB1, DDX3X) %>%
fisher.test(simulate.p.value = TRUE) %>%
getElement("p.value")
```
Wow! Looks like we've replicated our original finding!
I'll bet this will look super cool when we plot it.
```{r, anotherPlot}
# let's use pheatmap again
pheatmap(t(data.matrix(northcott_muts)), col=c("white", "darkred"), cluster_rows=FALSE,
clustering_distance_cols="manhattan", clustering_method="ward.D2", legend=FALSE)
```
Well, that looks a bit different, doesn't it?
They're there, but maybe not as overlappy as before.
Say... are some samples from the original PCGP?
```{r, doubleDipping}
northcott_caselists %>%
pull(case_ids) %>%
str_split(pattern=" ") %>%
getElement(1) ->
northcott_cases
mbl_caselists %>%
pull(case_ids) %>%
str_split(pattern=" ") %>%
getElement(1) ->
stjude_cases
intersect(northcott_cases, stjude_cases)
```
Whoops. Better drop those for the replication, eh?
```{r, dropCases}
northcott_only <- setdiff(northcott_cases, stjude_cases)
northonly_muts <- get_muts(mycgds, # as before
mbl_genes, # as before
geneticProfiles=northcott_mutations_profile,
cases=northcott_only)
```
(This is a recurrent issue with cBioPortal studies -- you just have to check.)
Is the association still statistically significant?
```{r, significance}
message("Chi-squared p-value (Northcott ONLY):", appendLF = FALSE)
northonly_muts %>%
tabyl(CTNNB1, DDX3X) %>%
chisq.test(simulate.p.value = TRUE) %>%
getElement("p.value")
message("Fisher's exact test p-value (Northcott ONLY):", appendLF = FALSE)
northonly_muts %>%
tabyl(CTNNB1, DDX3X) %>%
fisher.test(simulate.p.value = TRUE) %>%
getElement("p.value")
```
Yes it is. But is the odds ratio for co-occurrence in the same ballpark?
```{r, oddsRatioReplication}
muts %>%
tabyl(CTNNB1, DDX3X) %>%
fisher.test() %>%
getElement("estimate") ->
StJ_estimate
northonly_muts %>%
tabyl(CTNNB1, DDX3X) %>%
fisher.test() %>%
getElement("estimate") ->
northcott_estimate
message("Co-occurrence odds ratio (St. Jude cases): ", round(StJ_estimate, 3))
message("Co-occurrence odds ratio (DKFZ cases): ", round(northcott_estimate, 3))
message("Effect size inflation, St. Jude vs. Northcott: ",
round(StJ_estimate / northcott_estimate), "x")
```
The estimate from the St. Jude's study is *ten times* as large as from Northcott's.
# The winner's curse and replication
It turns out this happens a lot. It's rarely intentional (see above).
Early results in small studies can over- or under-estimate effect sizes
(and, sometimes, significance or sign) relative to larger or later studies.
(Unfortunately, the later studies almost invariably do not make the news.)
Recent work has quantified [the many challenges of replicating scientific research](https://elifesciences.org/articles/67995), particularly
[the difficulty of interpreting the results](https://elifesciences.org/articles/71601)
when all is said and done. In short, just over a quarter of high-profile cancer
biology studies that the authors tried to replicate could be started at all. Of
those that could be reproduced, about half replicated. Already in this class, we've
seen at least one result that _could not possibly have been interpreted as significant_ (because it was literal experimental noise) cited over 1000 times as justification for continuing to parrot the same nonsense. Usually the problems are a bit more subtle, and sometimes they're just bad luck. More specifically, regression to the mean.
## Regression to the mean
[Regression to the mean](https://www.stevejburr.com/post/scatter-plots-and-best-fit-lines/)
describes the tendency of successive estimates, particularly from small samples,
to over- and under-shoot the true effect sizes. This cuts both ways: usually in
biology and medicine we worry about overestimates of effect sizes, but smallish
experiments can also underestimate important effects. Depending on incentives,
one or the other may be more desirable than a consistent estimate of modest size.
This is not limited to experimental biology; it can be readily seen in
[the gold standard in biomedical research, the randomized clinical trial](https://www.bmj.com/content/346/bmj.f2304). In short, if you run under-
powered experiments, most of the time you'll miss an effect even if it's there,
but sometimes you'll wildly overestimate. Neither of these are good things.
# Simulations, power, and reducing the variance of estimates
Let's make this concrete with some simulations. We'll adjust the effect size
slightly for co-mutations of _CTNNB1_ and _DDX3X_ in medulloblastoma, then run
some simulations at various sample sizes to see what we see. (You can also use
an analytical estimate via `power.prop.test` and similar, but for better or worse,
simulating from a noisy generating process is about the same amount of work for
powering Fisher's test, as is the case for many tests of significance, as in trials.)
In order to take into account uncertainty (proposing that we found what we found
in the original PCGP study and wanted to estimate the odds of seeing it again),
we'll use the [beta distribution](http://varianceexplained.org/statistics/beta_distribution_and_baseball/) to capture a "noisy" estimate of a proportion. Specifically, let's
use the original mutation table to estimate each from the PCGP data.
```{r, probabilities}
neither <- nrow(subset(muts, CTNNB1 == 0 & DDX3X == 0))
CTNNB1 <- nrow(subset(muts, CTNNB1 == 1 & DDX3X == 0))
DDX3X <- nrow(subset(muts, CTNNB1 == 0 & DDX3X == 1))
both <- nrow(subset(muts, CTNNB1 == 1 & DDX3X == 1))
```
Now we have all we really need to simulate. Formally, we will model it like so.
* For each sample, we simulate the occurrence of _one_ mutation.
* If a sample has _one_ mutation, we simulate which one (CTNNB1 or DDX3X).
* If a sample has fewer or more than _one_ mutation, we simulate which.
We can do this repeatedly to estimate the distribution of test
statistics to expect if we run this experiment quite a few times,
with both smaller and larger total sample sizes. We're assuming that
the dependency structure is fairly stable (is this reasonable?).
The `Beta(a, b)` distribution above is continuous between 0 and 1, and its shape depends
on the values of `a` and `b`. For example, we can plot each of the above using the
St. Jude's-derived values to get a feel for how "mushy" our guesses are given
the number of samples in the St Jude PCGP study. Effectively, we propagate our
underlying uncertainty about parameters by drawing them from a sensible generator,
and that sensible generator is a beta distribution reflecting our sample size.
```{r, betas}
a <- CTNNB1
b <- DDX3X
p_one <- function(x) dbeta(x, (a + b), (both + neither))
p_both <- function(x) dbeta(x, both, (a + b + neither))
p_both_if_not_one <- function(x) dbeta(x, both, neither)
plot(p_one, main="Pr(A|B & !(A & B))")
plot(p_both, main="Pr(A & B)")
plot(p_both_if_not_one, main="Pr( (A & B) | (A + B != 1))")
```
Now for `n` samples, we can simulate appropriately "noisy" 2x2 tables with that many subjects.
```{r, simulate2x2}
sim2x2 <- function(n, neither, a, b, both) {
p_one <- rbeta(1, (a + b), (both + neither))
p_both <- rbeta(1, both, neither)
p_a <- rbeta(1, a, b)
n_a_b <- rbinom(1, n, p_one)
n_neither_both <- n - n_a_b
n_both <- rbinom(1, n_neither_both, p_both)
n_neither <- n_neither_both - n_both
n_a <- rbinom(1, n_a_b, p_a)
n_b <- n_a_b - n_a
as.table(matrix(c(n_neither, n_a, n_b, n_both), nrow=2))
}
```
Let's give it a shot.
```{r, someSims}
a <- CTNNB1
b <- DDX3X
sim2x2(n=nrow(muts), neither, a, b, both)
```
That seems to work fine (we could more directly simulate the odds of neither+either and a+b, feel free to implement that instead). If you want an analytical estimate for an asymptotic
test (`prop.test`), R also provides that, but beware: it doesn't really take into account
sampling variance (i.e. uncertainty about the parameter estimates). Let's do that ourselves.
```{r, wrappers}
# fairly generic
simFisher <- function(n, neither, a, b, both) fisher.test(sim2x2(n, neither, a, b, both))
# using the values we've already set up to simulate from:
simFetP <- function(n) simFisher(n, neither, a, b, both)$p.value
```
The `replicate` function is quite helpful here. Let's suppose the St. Jude
study is representative of medulloblastoma generally. We'll simulate 1000
studies of sizes between 10 and 500 to see how often our (true!) difference
in proportions registers as significant at p < 0.05.
```{r, sampleSizes}
powerN <- function(n, alpha=0.05) {
res <- table(replicate(1000, simFetP(n=n)) < alpha)
res["TRUE"] / sum(res)
}
for (N in c(10, 30, 50, 100, 300, 500)) {
message("Power at alpha = 0.05 with n = ", N, ": ", powerN(N) * 100, "%")
}
```
_Question_: What's the estimated power with a sample size of 37?
<details>
<summary>Click here for an answer</summary>
```{r, power37}
message("Power at alpha = 0.05 with n = ", 37, ": ", powerN(37) * 100, "%")
```
</details>
_Question_: How does that compare to `power.prop.test` with p1 = (neither+both)/(all),
p2 = (CTNNB1only + DDX3Xonly)/(all), and n=37? Does this seem like and over- or under-
estimate relative to Fisher's exact test (above)?
<details>
<summary>Click here for an answer</summary>
```{r, powerPropTest}
power.prop.test(n=37, p1=(35/37), p2=(2/37))
```
</details>
For reasons we'll see shortly, I find the above to be a gross overestimate.
What about our odds ratio estimates? Do they hop around all over the place?
Let's add a pseudocount to the shrunken odds ratio estimator to stabilize them.
```{r, oddsRatios}
# how wild are our odds ratios at a given N?
shrinkOR <- function(n, pseudo=2) {
res <- sim2x2(n, neither, a, b, both) + pseudo
odds <- (res[1,1] * res[2,2]) / (res[1,2] * res[2,1])
return(odds)
}
OR0 <- function(n) replicate(1000, shrinkOR(n, pseudo=1e-6))
for (N in c(10, 20, 40, 80)) {
hist(OR0(n=N), xlab="Estimate", main=paste("Near-raw odds ratio distribution with N =", N))
}
# And if we shrink a bit (i.e., apply a prior)?
ORs <- function(n) replicate(1000, shrinkOR(n))
for (N in c(10, 20, 40, 80)) {
hist(ORs(n=N), xlab="Estimate", main=paste("Shrunken odds ratio distribution with N =", N))
}
```
(Some of you may already know the log-odds ratio trick to turn the above into a bell curve.
If not, try re-plotting the histograms above, but using the log of the OR estimates.)
The tails of the estimates are pretty long, but the mass concentrates quickly near the true
parameter estimate. (This is also why resampling approaches can help stabilize estimates:
if you have enough data to estimate a parameter, resampling can also estimate how fragile
your estimates are, and therefore how trustworthy they are. That's why we bootstrap.)
# Thoughts and questions
* Run `with(muts, table(CTNNB1, DDX3X)) %>% fisher.test`. What does the 95% CI represent?
* Fisher's exact test is a special case of hypergeometric testing. What others exist?
What are these tests typically used for, and what assumptions are made regarding categories?
* What happens if you test multiple hypotheses, or apply multiple tests of the same?
* Does it matter which correction method in `p.adjust` you use to correct if you do?
* Is there a situation in cancer genetic epidemiology where the FWER would make more
sense than the FDR (i.e., where you want to bound the probability of any false positives)?
* Can you rig up a simulation where the cost for a false negative is much greater than for
a false positive, and tally up the results of various `p.adjust` methods in this situation?
* Can you rig up a simulation where the cost for a false positive is much greater than for
a false negative, and tally up the results of various `p.adjust` schemes for that?
* (Nontrivial) How would you adjust this for a 2x3 or 3x2 table like in Wang et al. 2014?
* (Nontrivial) How does the addition of a pseudocount stabilize the variance of estimates? What is being traded away when we do this, and is it (typically) a worthwhile trade?