title

author

date

output

cBioPortal: Medulloblastoma

Tim Triche, Jr.

12/8/2021

html_document

keep_md	toc
true	true

Automating cBioPortal queries

An unpackaged Rmarkdown example

This document pulls some data from cBioPortal via cgdsr, and performs a bit of analysis on it. (A separate example exists for slides, as with a 5-minute presentation.) Let's begin by looking at medulloblastoma, a mostly-pediatric tumor of the medulla oblongata, or lower brainstem. We will pull a small study from cBioPortal and then see how the estimate of co-mutation odds for two genes hold up in a much larger study.

Medulloblastoma background information

A recent paper from Volker Hovestadt provides some more details on the features of these tumors, which have (like most childhood brain cancers) been an object of intense study (including at VAI). Of note, Hedgehog signaling, Wnt signaling, and a constellation of alterations lumped together via DNA methylation profiling as "Group 3" and "Group 4" define the WHO subtypes of medulloblastoma among patients thus far characterized.

cBioPortal data via the `cgdsr` package

Let's pull some data from cBioPortal using the cgdsr package and see what studies are available for this disease.

The first step is to create a CGDS (Cancer Genome Data Server) object to manage cBioPortal queries.

Load required libraries

library(cgdsr)

## Please send questions to cbioportal@googlegroups.com

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(pheatmap)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

mycgds <- CGDS("http://www.cbioportal.org/")
show(mycgds)

## [1] "CGDS: 0x55be020a1528"

That's not terribly useful. Let's ask the cBioPortal object what studies we can query.

studies <- getCancerStudies(mycgds)
glimpse(studies)

## Rows: 333
## Columns: 3
## $ cancer_study_id <chr> "paac_jhu_2014", "mel_tsam_liang_2017", "all_stjude_20…
## $ name            <chr> "Acinar Cell Carcinoma of the Pancreas (JHU, J Pathol …
## $ description     <chr> "Whole exome sequencing of 23 surgically resected panc…

Much better. It looks like there are 333 studies to choose from at the moment.

Fetching medulloblastoma studies

Let's narrow our field down to the ones that involve medulloblastoma. When we loaded the tidyverse package, it also loaded stringr, which gives us a way to filter data frames or tibbles based on whether a column matches a string.

getCancerStudies(mycgds) %>% 
  filter(str_detect(name, "Medulloblastoma")) %>% 
  select(cancer_study_id, name) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

cancer_study_id	name
mbl_broad_2012	Medulloblastoma (Broad, Nature 2012)
mbl_dkfz_2017	Medulloblastoma (DKFZ, Nature 2017)
mbl_icgc	Medulloblastoma (ICGC, Nature 2012)
mbl_pcgp	Medulloblastoma (PCGP, Nature 2012)
mbl_sickkids_2016	Medulloblastoma (Sickkids, Nature 2016)

It would be nice to find out how many samples are in each study. It turns out that this can be found in the description column via a bit of chicanery (i.e., throwing out anything in the description that isn't a number). No guarantees that this will always work for other studies -- use this at your own risk! As a bonus, let's order the studies from smallest to largest.

getCancerStudies(mycgds) %>% 
  filter(str_detect(name, "Medulloblastoma")) %>% 
  mutate(n = as.integer(str_extract(description, "[0-9]+"))) %>% 
  select(cancer_study_id, n, name) %>% 
  arrange(n) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

cancer_study_id	n	name
mbl_pcgp	37	Medulloblastoma (PCGP, Nature 2012)
mbl_sickkids_2016	46	Medulloblastoma (Sickkids, Nature 2016)
mbl_broad_2012	92	Medulloblastoma (Broad, Nature 2012)
mbl_icgc	125	Medulloblastoma (ICGC, Nature 2012)
mbl_dkfz_2017	491	Medulloblastoma (DKFZ, Nature 2017)

This is encouraging. The Hovestadt paper I mentioned above (where the DKFZ folks looked at single-cell data from human and mouse) is a follow-up to one with 491 cases! (Most of the time, the Germans have the most samples. The big exception to this is TARGET liquid tumors: we take pride in dwarfing their cohorts.)

Fetching case lists to collate mutations and other aberrations

For the sake of simplicity, let's start with the Pediatric Cancer Genome Project (PCGP) medulloblastoma study and its 37 subjects. getCaseLists(CGDS, cancerStudy) does that:

mbl_study <- "mbl_pcgp"

# the IDs of the cases in the PCGP MBL study
getCaseLists(mycgds, cancerStudy=mbl_study) %>% 
  filter(case_list_name == "Samples with mutation data") ->
    mbl_caselists


# grab the list of lesions
getGeneticProfiles(mycgds, mbl_study) %>% 
  filter(genetic_profile_name == "Mutations") %>% 
  pull(genetic_profile_id) ->
    mbl_mutations_profile

# a few "greatest hits" mutations seen in MBL
mbl_genes <- c("CTNNB1", "DDX3X", "FAP", "SF3B1", "IFIT3", "TP53")

# get the mutations data and tidy it up
get_muts <- function(x, genes, ...) {
  
  muts <- getProfileData(x, genes, ...)
  is.na(muts) <- (muts == "NaN")
  muts[is.na(muts)] <- 0
  muts[muts != 0] <- 1  
  rn <- rownames(muts)
  muts <- data.frame(lapply(muts, as.integer))
  rownames(muts) <- rn
  return(muts[, genes])
  
}

# We throw out the nature of the mutation here, which is rarely a wise idea (if ever)
muts <- get_muts(mycgds, 
                 mbl_genes, 
                 geneticProfiles=mbl_mutations_profile, 
                 caseList=mbl_caselists$case_list_id)

muts %>% 
  filter(rowSums(.) > 0) %>%
  t() %>%
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

	SJMB041	SJMB001	SJMB028	SJMB040	SJMB004	SJMB033	SJMB027	SJMB014	SJMB015	SJMB026	SJMB029
CTNNB1	1	1	1	1	0	0	0	0	0	0	0
DDX3X	0	1	1	1	0	1	0	0	0	0	0
FAP	1	0	0	0	0	0	0	0	1	0	0
SF3B1	0	0	0	0	0	0	0	0	0	1	1
IFIT3	0	0	0	0	0	0	1	1	0	0	0
TP53	0	0	0	0	1	0	0	0	0	0	0

That's rather a lot to digest even if we do toss out the cases without detected mutations. Perhaps we'd be well served by plotting the data instead.

Plotting (some of) the results from our query

You can also embed plots, for example a sort-of-oncoprint of this study:

This is rather informative, in that one can often pick up patterns of mutual exclusivity between groups of subjects and groups of genes. In the case of medulloblastoma, the WHO subgroups of the tumor are Wnt (primarily driven by beta-catenin, aka CTNNB1, or APC mutations), Shh (primarily driven by PTCH1 or sometimes structural variants), Group 3, and Group 4, where the latter don't have one defining class of driver mutation. We note that in this small 2012 study, it is hard to tell which of the latter is which (later resolved by DNA methylation). (The WHO classification of medulloblastoma came about in 2016, for what it's worth.)

Based on our earlier plot, we might wonder whether the beta-catenin gene and the DDX3X X-linked helicase gene are co-mutated. We can test this similarly to our previous adventures in mutual exclusivity with IDH1, IDH2, TET2, and WT1:

message("Chi-squared p-value:", appendLF = FALSE)

## Chi-squared p-value:

muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  chisq.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 0.002498751

message("Fisher's exact test p-value:", appendLF = FALSE)

## Fisher's exact test p-value:

muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  fisher.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 0.002013778

Power, sample size, and variance

More is not better, better is better...

I once spent a summer in Italy, eating much smaller meals but still feeling full. Italians (or Genovese, at least) are fond of reminding Americans that di più non è meglio; _meglio è meglio. ("More is not better; better is better.") And from a statistical standpoint, this is usually true: a small, well-executed survey or experiment can beat the pants off of a huge, poorly-executed survey or experiment. (Xiao-Li Meng at Harvard has dubbed this the curse of Big Data, although it's really more of a curse of crappy and/or perversely incentivized experiments). Nevertheless...

...but more of the same is often good enough

If we can get more observations and at the same time better observations, it would be rather silly not to. Doubly so when it's free! So let's do that.

Our earlier observations of co-operating mutations seemed highly significant! And in fact they're striking in the figure (see the importance of figures?). But will they replicate when we look at the same genes in a bigger study? (Don't peek at the Northcott paper if you can help it. It's more fun this way.)

northcott <- "mbl_dkfz_2017"

getCaseLists(mycgds, northcott) %>% 
  filter(case_list_name == "Samples with mutation data") ->
    northcott_caselists 

# grab the list of lesions
getGeneticProfiles(mycgds, northcott) %>% 
  filter(genetic_profile_name == "Mutations") %>% 
  pull(genetic_profile_id) ->
    northcott_mutations_profile

# grab the mutation matrix for the genes as before
northcott_muts <- get_muts(mycgds,      # as before
                           mbl_genes,   # as before 
                           northcott_mutations_profile, 
                           northcott_caselists$case_list_id)

# out of curiosity, how many of each mutation do we see here? 
colSums(northcott_muts)

## CTNNB1  DDX3X    FAP  SF3B1  IFIT3   TP53 
##     32     43      0      5      0     17

Looks like we're all set to replicate our previous results.

message("Chi-squared p-value (Northcott):", appendLF = FALSE)

## Chi-squared p-value (Northcott):

northcott_muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  chisq.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 0.0004997501

message("Fisher's exact test p-value (Northcott):", appendLF = FALSE)

## Fisher's exact test p-value (Northcott):

northcott_muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  fisher.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 2.770184e-06

Wow! Looks like we've replicated our original finding! I'll bet this will look super cool when we plot it.

# let's use pheatmap again
pheatmap(t(data.matrix(northcott_muts)), col=c("white", "darkred"), cluster_rows=FALSE,
         clustering_distance_cols="manhattan", clustering_method="ward.D2", legend=FALSE)

Well, that looks a bit different, doesn't it? They're there, but maybe not as overlappy as before. Say... are some samples from the original PCGP?

northcott_caselists %>% 
  pull(case_ids) %>% 
  str_split(pattern=" ") %>% 
  getElement(1) -> 
    northcott_cases

mbl_caselists %>%
  pull(case_ids) %>% 
  str_split(pattern=" ") %>% 
  getElement(1) ->
    stjude_cases

intersect(northcott_cases, stjude_cases)

##  [1] "SJMB001" "SJMB002" "SJMB003" "SJMB004" "SJMB006" "SJMB008" "SJMB009"
##  [8] "SJMB010" "SJMB011" "SJMB012" "SJMB013" "SJMB014" "SJMB015" "SJMB016"
## [15] "SJMB017" "SJMB018" "SJMB019" "SJMB020" "SJMB023" "SJMB024" "SJMB025"
## [22] "SJMB026" "SJMB027" "SJMB028" "SJMB029" "SJMB030" "SJMB031" "SJMB032"
## [29] "SJMB033" "SJMB034" "SJMB035" "SJMB036" "SJMB037" "SJMB038" "SJMB039"
## [36] "SJMB040" "SJMB041" ""

Whoops. Better drop those for the replication, eh?

northcott_only <- setdiff(northcott_cases, stjude_cases)

northonly_muts <- get_muts(mycgds,      # as before
                           mbl_genes,   # as before 
                           geneticProfiles=northcott_mutations_profile,
                           cases=northcott_only)

(This is a recurrent issue with cBioPortal studies -- you just have to check.) Is the association still statistically significant?

message("Chi-squared p-value (Northcott ONLY):", appendLF = FALSE)

## Chi-squared p-value (Northcott ONLY):

northonly_muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  chisq.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 0.0009995002

message("Fisher's exact test p-value (Northcott ONLY):", appendLF = FALSE)

## Fisher's exact test p-value (Northcott ONLY):

northonly_muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  fisher.test(simulate.p.value = TRUE) %>% 
  getElement("p.value")

## [1] 0.0002105828

Yes it is. But is the odds ratio for co-occurrence in the same ballpark?

muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  fisher.test() %>% 
  getElement("estimate") -> 
    StJ_estimate

northonly_muts %>% 
  tabyl(CTNNB1, DDX3X) %>% 
  fisher.test() %>% 
  getElement("estimate") ->
    northcott_estimate

message("Co-occurrence odds ratio (St. Jude cases): ", round(StJ_estimate, 3))

## Co-occurrence odds ratio (St. Jude cases): 62.548

message("Co-occurrence odds ratio (DKFZ cases): ", round(northcott_estimate, 3))

## Co-occurrence odds ratio (DKFZ cases): 6.207

message("Effect size inflation, St. Jude vs. Northcott: ",
        round(StJ_estimate / northcott_estimate), "x")

## Effect size inflation, St. Jude vs. Northcott: 10x

The estimate from the St. Jude's study is ten times as large as from Northcott's.

The winner's curse and replication

It turns out this happens a lot. It's rarely intentional (see above). Early results in small studies can over- or under-estimate effect sizes (and, sometimes, significance or sign) relative to larger or later studies. (Unfortunately, the later studies almost invariably do not make the news.)

Recent work has quantified the many challenges of replicating scientific research, particularly the difficulty of interpreting the results when all is said and done. In short, just over a quarter of high-profile cancer biology studies that the authors tried to replicate could be started at all. Of those that could be reproduced, about half replicated. Already in this class, we've seen at least one result that could not possibly have been interpreted as significant (because it was literal experimental noise) cited over 1000 times as justification for continuing to parrot the same nonsense. Usually the problems are a bit more subtle, and sometimes they're just bad luck. More specifically, regression to the mean.

Regression to the mean

Regression to the mean describes the tendency of successive estimates, particularly from small samples, to over- and under-shoot the true effect sizes. This cuts both ways: usually in biology and medicine we worry about overestimates of effect sizes, but smallish experiments can also underestimate important effects. Depending on incentives, one or the other may be more desirable than a consistent estimate of modest size.

This is not limited to experimental biology; it can be readily seen in the gold standard in biomedical research, the randomized clinical trial. In short, if you run under- powered experiments, most of the time you'll miss an effect even if it's there, but sometimes you'll wildly overestimate. Neither of these are good things.

Simulations, power, and reducing the variance of estimates

Let's make this concrete with some simulations. We'll adjust the effect size slightly for co-mutations of CTNNB1 and DDX3X in medulloblastoma, then run some simulations at various sample sizes to see what we see. (You can also use an analytical estimate via power.prop.test and similar, but for better or worse, simulating from a noisy generating process is about the same amount of work for powering Fisher's test, as is the case for many tests of significance, as in trials.)

In order to take into account uncertainty (proposing that we found what we found in the original PCGP study and wanted to estimate the odds of seeing it again), we'll use the beta distribution to capture a "noisy" estimate of a proportion. Specifically, let's use the original mutation table to estimate each from the PCGP data.

neither <- nrow(subset(muts, CTNNB1 == 0 & DDX3X == 0))
CTNNB1 <- nrow(subset(muts, CTNNB1 == 1 & DDX3X == 0))
DDX3X <- nrow(subset(muts, CTNNB1 == 0 & DDX3X == 1))
both <- nrow(subset(muts, CTNNB1 == 1 & DDX3X == 1))

Now we have all we really need to simulate. Formally, we will model it like so.

For each sample, we simulate the occurrence of one mutation.
If a sample has one mutation, we simulate which one (CTNNB1 or DDX3X).
If a sample has fewer or more than one mutation, we simulate which.

We can do this repeatedly to estimate the distribution of test statistics to expect if we run this experiment quite a few times, with both smaller and larger total sample sizes. We're assuming that the dependency structure is fairly stable (is this reasonable?).

The Beta(a, b) distribution above is continuous between 0 and 1, and its shape depends on the values of a and b. For example, we can plot each of the above using the St. Jude's-derived values to get a feel for how "mushy" our guesses are given the number of samples in the St Jude PCGP study. Effectively, we propagate our underlying uncertainty about parameters by drawing them from a sensible generator, and that sensible generator is a beta distribution reflecting our sample size.

a <- CTNNB1
b <- DDX3X 

p_one <- function(x) dbeta(x, (a + b), (both + neither))
p_both <- function(x) dbeta(x, both, (a + b + neither))
p_both_if_not_one <- function(x) dbeta(x,  both, neither)

plot(p_one, main="Pr(A|B & !(A & B))")

plot(p_both, main="Pr(A & B)")

plot(p_both_if_not_one, main="Pr( (A & B) | (A + B != 1))")

Now for n samples, we can simulate appropriately "noisy" 2x2 tables with that many subjects.

sim2x2 <- function(n, neither, a, b, both) {

  p_one <- rbeta(1, (a + b), (both + neither))
  p_both <- rbeta(1, both, neither)
  p_a <- rbeta(1, a, b)
  
  n_a_b <- rbinom(1, n, p_one)
  n_neither_both <- n - n_a_b
  n_both <- rbinom(1, n_neither_both, p_both)
  n_neither <- n_neither_both - n_both
  n_a <- rbinom(1, n_a_b, p_a)
  n_b <- n_a_b - n_a

  as.table(matrix(c(n_neither, n_a, n_b, n_both), nrow=2))
  
}

Let's give it a shot.

a <- CTNNB1
b <- DDX3X 

sim2x2(n=nrow(muts), neither, a, b, both)

##    A  B
## A 29  2
## B  2  4

That seems to work fine (we could more directly simulate the odds of neither+either and a+b, feel free to implement that instead). If you want an analytical estimate for an asymptotic test (prop.test), R also provides that, but beware: it doesn't really take into account sampling variance (i.e. uncertainty about the parameter estimates). Let's do that ourselves.

# fairly generic 
simFisher <- function(n, neither, a, b, both) fisher.test(sim2x2(n, neither, a, b, both))

# using the values we've already set up to simulate from: 
simFetP <- function(n) simFisher(n, neither, a, b, both)$p.value

The replicate function is quite helpful here. Let's suppose the St. Jude study is representative of medulloblastoma generally. We'll simulate 1000 studies of sizes between 10 and 500 to see how often our (true!) difference in proportions registers as significant at p < 0.05.

powerN <- function(n, alpha=0.05) {
  res <- table(replicate(1000, simFetP(n=n)) < alpha)
  res["TRUE"] / sum(res)
}

for (N in c(10, 30, 50, 100, 300, 500)) {
  message("Power at alpha = 0.05 with n = ", N, ": ", powerN(N) * 100, "%")
}

## Power at alpha = 0.05 with n = 10: 16%

## Power at alpha = 0.05 with n = 30: 67.3%

## Power at alpha = 0.05 with n = 50: 83.9%

## Power at alpha = 0.05 with n = 100: 94.4%

## Power at alpha = 0.05 with n = 300: 99.4%

## Power at alpha = 0.05 with n = 500: 99.4%

Question: What's the estimated power with a sample size of 37?

Click here for an answer

   message("Power at alpha = 0.05 with n = ", 37, ": ", powerN(37) * 100, "%")

## Power at alpha = 0.05 with n = 37: 73.5%

Question: How does that compare to power.prop.test with p1 = (neither+both)/(all), p2 = (CTNNB1only + DDX3Xonly)/(all), and n=37? Does this seem like and over- or under- estimate relative to Fisher's exact test (above)?

Click here for an answer

  power.prop.test(n=37, p1=(35/37), p2=(2/37))

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 37
##              p1 = 0.9459459
##              p2 = 0.05405405
##       sig.level = 0.05
##           power = 1
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

For reasons we'll see shortly, I find the above to be a gross overestimate. What about our odds ratio estimates? Do they hop around all over the place? Let's add a pseudocount to the shrunken odds ratio estimator to stabilize them.

# how wild are our odds ratios at a given N?
shrinkOR <- function(n, pseudo=2) {
  res <- sim2x2(n, neither, a, b, both) + pseudo
  odds <- (res[1,1] * res[2,2]) / (res[1,2] * res[2,1])
  return(odds)
}

OR0 <- function(n) replicate(1000, shrinkOR(n, pseudo=1e-6))

for (N in c(10, 20, 40, 80)) {
  hist(OR0(n=N), xlab="Estimate", main=paste("Near-raw odds ratio distribution with N =", N))
}

# And if we shrink a bit (i.e., apply a prior)?
ORs <- function(n) replicate(1000, shrinkOR(n))

for (N in c(10, 20, 40, 80)) {
  hist(ORs(n=N), xlab="Estimate", main=paste("Shrunken odds ratio distribution with N =", N))
}

(Some of you may already know the log-odds ratio trick to turn the above into a bell curve. If not, try re-plotting the histograms above, but using the log of the OR estimates.)

The tails of the estimates are pretty long, but the mass concentrates quickly near the true parameter estimate. (This is also why resampling approaches can help stabilize estimates: if you have enough data to estimate a parameter, resampling can also estimate how fragile your estimates are, and therefore how trustworthy they are. That's why we bootstrap.)

Thoughts and questions

Run with(muts, table(CTNNB1, DDX3X)) %>% fisher.test. What does the 95% CI represent?
Fisher's exact test is a special case of hypergeometric testing. What others exist? What are these tests typically used for, and what assumptions are made regarding categories?
What happens if you test multiple hypotheses, or apply multiple tests of the same?
Does it matter which correction method in p.adjust you use to correct if you do?
Is there a situation in cancer genetic epidemiology where the FWER would make more sense than the FDR (i.e., where you want to bound the probability of any false positives)?
Can you rig up a simulation where the cost for a false negative is much greater than for a false positive, and tally up the results of various p.adjust methods in this situation?
Can you rig up a simulation where the cost for a false positive is much greater than for a false negative, and tally up the results of various p.adjust schemes for that?
(Nontrivial) How would you adjust this for a 2x3 or 3x2 table like in Wang et al. 2014?
(Nontrivial) How does the addition of a pseudocount stabilize the variance of estimates? What is being traded away when we do this, and is it (typically) a worthwhile trade?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cBioPortalMedulloblastoma.md

cBioPortalMedulloblastoma.md

Automating cBioPortal queries

An unpackaged Rmarkdown example

Medulloblastoma background information

cBioPortal data via the `cgdsr` package

Fetching medulloblastoma studies

Fetching case lists to collate mutations and other aberrations

Plotting (some of) the results from our query

Power, sample size, and variance

More is not better, better is better...

...but more of the same is often good enough

The winner's curse and replication

Regression to the mean

Simulations, power, and reducing the variance of estimates

Thoughts and questions

Files

cBioPortalMedulloblastoma.md

Latest commit

History

cBioPortalMedulloblastoma.md

File metadata and controls

Automating cBioPortal queries

An unpackaged Rmarkdown example

Medulloblastoma background information

cBioPortal data via the cgdsr package

Fetching medulloblastoma studies

Fetching case lists to collate mutations and other aberrations

Plotting (some of) the results from our query

Power, sample size, and variance

More is not better, better is better...

...but more of the same is often good enough

The winner's curse and replication

Regression to the mean

Simulations, power, and reducing the variance of estimates

Thoughts and questions

cBioPortal data via the `cgdsr` package