title | author | date | output | ||||||
---|---|---|---|---|---|---|---|---|---|
Uncertainity in RNA |
Sonali Arora, Hamid Bolouri |
December 7, 2018 |
|
This github repository contains code to reproduce the analysis in our paper "Variability in estimated gene expression among commonly used RNA-seq pipelines". The paper is now published at Scientific Reports here.
This github includes a large number of additional supplementary figures, not present in the online version of this paper.
- Additional Fig1 : Example of a few discordant Genes in TCGA
- Additional Fig2 : Example of a few discordant Genes in GTEx
- Additional Fig 3. TCGA batch effects
- Additional Fig 4: GTEX batch effects
To run our code, you need to download data from two different sources
a) Download all source data and processed Summarized Experiment objects from Amazon S3 bucket.
b) These vignettes from the github directory.
Both the folder from Amazon s3 bucket (ie OriginalTCGAGTExData) and the folder containing vignettes (git repository) should be saved in the same folder. As an example, one could save both of the above folders under "Downloads" as shown below
# folder where S3BUCKET data and github directory are stored. eg: ~/Downloads
# github directory eg: ~/Downloads/UncertaintyRNA
# S3 bucket directory eg: ~/Downloads/OriginalTCGAGTExData
# when you run our RMD files, a new subfolder called "data will be created"
# This will essentially remake the "data" subfolder from github repository.
# eg:~/Downloads/data
If you want to download only the final SE Objects to recreate figures in our paper, the below mentioned code will create a folder called "OriginalTCGAGTExData" and only one sub-folder "SE_objects" and its contents will be downloaded to it.
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/SE_objects/index.html
If you would like to download all the data associated with this Paper, it is recommended to download the data in chunks using the following commands
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/annotations/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_XENA/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_GDC/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_GTEX_v6/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_MSKCC/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_PICCOLO/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_RECOUNT2_GTEX/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_RECOUNT2_TCGA/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/combined_SEobjects/index.html
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/SE_objects/index.
wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/raw_counts/index.html
WARNING: Please note that downloading all the data will take take a long time.
wget --recursive -nH --cut-dirs=1 https://s3-us-west-2.amazonaws.com/fh-pi-holland-e/OriginalTCGAGTExData/index.html
The above line will create a folder called "OriginalTCGAGTExData" and the following sub-folders
- annotations
- data source_GDC
- data source_GTEX_v6
- data source_MSKCC
- data source_PICCOLO
- data source_RECOUNT2_GTEX
- data source_RECOUNT2_TCGA
- data source_XENA
- combined_SEobjects
- SE_objects
One can clone this github repository with :
git clone https://github.com/sonali-bioc/UncertaintyRNA.git
The md5sum for all downloaded files from s3 bucket have been places here.
The steps below provide a roadmap for the analysis done in the paper:
In this vignette we show in detail how data was downloaded from each source of TCGA Data. For easier manipulation of this large data set, we convert the large text files to SummarizedExperiment objects.
In this vignette we show in detail how data was downloaded from each source of GTEx Data. For easier manipulation of this large data set, we convert the large text files to SummarizedExperiment objects.
In this vignette , we first find common genes and common samples present in each source of TCGA Data. Next, we convert RPKM normalized data to TPM normalized data.
In this vignette , we first find common genes and common samples present in each source of GTEx Data. Next, we convert RPKM normalized data to TPM normalized data.
In this vignette, we take RPKM normalized data from all sources of TCGA and GTEx data and compute Principal Components to see how similar/dissimilar these data sources are. The results from PCA analysis are stored as text files, which can be used later on for plotting in multi-panel figures.
In this vignette, we use TPM normalized data from all sources of TCGA and GTEx data and compute Principal Components to see how similar/dissimilar these data sources are. The results from PCA analysis are stores as text files, which can be used later on for plotting in multi-panel figures.
In this vignette, we calculate
- discordant genes across various TCGA sources
- discordant genes across various GTEx sources
- discordant samples across various TCGA sources
- discordant samples across various GTEx sources
- compare the discordant genes to disease related genes
- compare the discordant genes to multi-mapped reads as reported by Robert et al.
In this vignette, we show detailed calculation for Fig2b of our paper.
The authors from Xena/Toil have made available both the log2(TPM+0.001) and log2(FPKM+0.001) counts. In this vignette, we explore the two datasets from Xena/Toil, and explain why we use one source over the other.
In this vignette, we calculate various Supplemental Tables for our paper. These tables are also subsequently used in our analysis. They include
- mRNA correlations across various TCGA sources
- mRNA correlations across various GTEx sources
- Protein-mRNA correlations across various TCGA sources
In this vignette, we make various supplemental figure for our paper.
In this vignette, we make various PCA plots for each type of cancer using the following Batch variables: TSS, PlateID and Sequencing center for various source of TCGA data.
In this vignette, we make various PCA plots using "Nucleic Acid" and "Genotype" Batches for all sources of GTEx data.
In this vignette, we follow the approach showed in Wang et al, to taking three example regions = "Thyroid, Stomach and Liver" from GTEx, and their corresponding cancer Types( "THCA", "LIHC", "STAD") and making PCA plots for each data source to see how similar/dissimilar TCGA and GTEx data are, for various data sources.
In this vignette, we reproduce Figure 1 of our paper.
In this vignette, we reproduce Figure 2 of our paper.
- Grossman, Robert L., Heath, Allison Pet al. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine
- Vivian J, Rao AA, Nothaft FA, et al. (2017) Toil enables reproducible, open source, big biomedical data analyses. Nature biotechnology.
- Collado-Torres L, Nellore A, et al (2017) Reproducible RNA-seq analysis using recount2. Nature biotechnology.
- Q. Wang, J Armenia, C. Zhang, A.V. Penson, E. Reznik, L. Zhang, T. Minet, A. Ochoa, B.E. Gross, C. A. Iacobuzio-Donahue, D. Betel, B.S. Taylor, J. Gao, N. Schultz. Unifying cancer and normal RNA sequencing data from different sources. Scientific Data 5:180061, 2018.
- Rahman M, et al. (2015) Alternative preprocessing of RNA-Sequencing data in TCGA leads to improved analysis results. Bioinformatics.
- The GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. (2013) Nature genetics.
- Robert, C. & Watson, M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol 16, 177 (2015)
- R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
All our analysis is done in R. We found the following R/Biocondcutor packages extremely useful in our analysis.
- SummarizedExperiment for creating and storing TCGA and GTEX data as SE objects.
- GenomicRanges for manipulating genomic ranges.
- rtracklayer for reading in GTF files quickly as GenomicRanges objects
- ggplot2 for making most of the plots in our paper.
- pheatmap for making heatmaps
- We also used RColorBrewer, UpSetR and eulerr
To ensure smooth execution of code in this repository, please install the following packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("SummarizedExperiment",
"GenomicRanges",
"rtracklayer",
"ggplot2",
"pheatmap",
"RColorBrewer",
"UpSetR",
"eulerr",
"gridExtra"))