GitHub - sonali-bioc/UncertaintyRNA: This github repository contains code to reproduce the analysis in our paper "Variability in estimated gene expression among commonly used RNA-seq pipelines"

title

author

date

output

Uncertainity in RNA

Sonali Arora, Hamid Bolouri

December 7, 2018

html_document

toc	theme
true	united

Introduction

This github repository contains code to reproduce the analysis in our paper "Variability in estimated gene expression among commonly used RNA-seq pipelines". The paper is now published at Scientific Reports here.

Additional Figures

This github includes a large number of additional supplementary figures, not present in the online version of this paper.

Additional Fig1 : Example of a few discordant Genes in TCGA
Additional Fig2 : Example of a few discordant Genes in GTEx
Additional Fig 3. TCGA batch effects
Additional Fig 4: GTEX batch effects

Downloading Data

Data folder Organization

To run our code, you need to download data from two different sources

a) Download all source data and processed Summarized Experiment objects from Amazon S3 bucket.

b) These vignettes from the github directory.

Both the folder from Amazon s3 bucket (ie OriginalTCGAGTExData) and the folder containing vignettes (git repository) should be saved in the same folder. As an example, one could save both of the above folders under "Downloads" as shown below


# folder where S3BUCKET data and github directory are stored. eg: ~/Downloads

# github directory eg: ~/Downloads/UncertaintyRNA

# S3 bucket directory eg: ~/Downloads/OriginalTCGAGTExData

# when you run our RMD files, a new subfolder called "data will be created"
# This will essentially remake the "data" subfolder from github repository.
# eg:~/Downloads/data

Amazon S3 Bucket Data

Download Processed Data

If you want to download only the final SE Objects to recreate figures in our paper, the below mentioned code will create a folder called "OriginalTCGAGTExData" and only one sub-folder "SE_objects" and its contents will be downloaded to it.

wget --recursive -nH --cut-dirs=1 https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/SE_objects/index.html

Download complete data in chunks

If you would like to download all the data associated with this Paper, it is recommended to download the data in chunks using the following commands

wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/annotations/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_XENA/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_GDC/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_GTEX_v6/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_MSKCC/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_PICCOLO/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_RECOUNT2_GTEX/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/datasource_RECOUNT2_TCGA/index.html

wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/combined_SEobjects/index.html
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/SE_objects/index.
wget --recursive -nH --cut-dirs=1  https://fh-pi-holland-e-eco-public.s3.us-west-2.amazonaws.com/OriginalTCGAGTExData/raw_counts/index.html

Complete Data Download

WARNING: Please note that downloading all the data will take take a long time.

wget --recursive -nH --cut-dirs=1 https://s3-us-west-2.amazonaws.com/fh-pi-holland-e/OriginalTCGAGTExData/index.html

The above line will create a folder called "OriginalTCGAGTExData" and the following sub-folders

annotations
data source_GDC
data source_GTEX_v6
data source_MSKCC
data source_PICCOLO
data source_RECOUNT2_GTEX
data source_RECOUNT2_TCGA
data source_XENA
combined_SEobjects
SE_objects

Clone this github repository

One can clone this github repository with :

git clone https://github.com/sonali-bioc/UncertaintyRNA.git

MD5SUM for downloaded files

The md5sum for all downloaded files from s3 bucket have been places here.

Vignette Overview

The steps below provide a roadmap for the analysis done in the paper:

1) Acquiring TCGA data

In this vignette we show in detail how data was downloaded from each source of TCGA Data. For easier manipulation of this large data set, we convert the large text files to SummarizedExperiment objects.

2) Acquiring GTEX data

In this vignette we show in detail how data was downloaded from each source of GTEx Data. For easier manipulation of this large data set, we convert the large text files to SummarizedExperiment objects.

3) Creating TPM Normalized SE objects for TCGA data.

In this vignette , we first find common genes and common samples present in each source of TCGA Data. Next, we convert RPKM normalized data to TPM normalized data.

4) Creating TPM Normalized SE objects for GTEx data

In this vignette , we first find common genes and common samples present in each source of GTEx Data. Next, we convert RPKM normalized data to TPM normalized data.

5) PCA using RPKM normalized data

In this vignette, we take RPKM normalized data from all sources of TCGA and GTEx data and compute Principal Components to see how similar/dissimilar these data sources are. The results from PCA analysis are stored as text files, which can be used later on for plotting in multi-panel figures.

6) PCA using TPM normalized data

In this vignette, we use TPM normalized data from all sources of TCGA and GTEx data and compute Principal Components to see how similar/dissimilar these data sources are. The results from PCA analysis are stores as text files, which can be used later on for plotting in multi-panel figures.

7) Discordant Genes

In this vignette, we calculate

discordant genes across various TCGA sources
discordant genes across various GTEx sources
discordant samples across various TCGA sources
discordant samples across various GTEx sources
compare the discordant genes to disease related genes
compare the discordant genes to multi-mapped reads as reported by Robert et al.

7(b) Differences in absolute log2 fold change of discordant samples within a data source

In this vignette, we show detailed calculation for Fig2b of our paper.

7(c) Discordant Genes

The authors from Xena/Toil have made available both the log2(TPM+0.001) and log2(FPKM+0.001) counts. In this vignette, we explore the two datasets from Xena/Toil, and explain why we use one source over the other.

8) Supplemental Tables

In this vignette, we calculate various Supplemental Tables for our paper. These tables are also subsequently used in our analysis. They include

mRNA correlations across various TCGA sources
mRNA correlations across various GTEx sources
Protein-mRNA correlations across various TCGA sources

9) Supplemental Figures

In this vignette, we make various supplemental figure for our paper.

10) Batches in TCGA Data

In this vignette, we make various PCA plots for each type of cancer using the following Batch variables: TSS, PlateID and Sequencing center for various source of TCGA data.

11) Batches in GTEx Data

In this vignette, we make various PCA plots using "Nucleic Acid" and "Genotype" Batches for all sources of GTEx data.

12) Combining GTEx and TCGA Data

In this vignette, we follow the approach showed in Wang et al, to taking three example regions = "Thyroid, Stomach and Liver" from GTEx, and their corresponding cancer Types( "THCA", "LIHC", "STAD") and making PCA plots for each data source to see how similar/dissimilar TCGA and GTEx data are, for various data sources.

13) Figure 1 of submitted paper

In this vignette, we reproduce Figure 1 of our paper.

14) Figure 2 of submitted paper

In this vignette, we reproduce Figure 2 of our paper.

References

Grossman, Robert L., Heath, Allison Pet al. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine
Vivian J, Rao AA, Nothaft FA, et al. (2017) Toil enables reproducible, open source, big biomedical data analyses. Nature biotechnology.
Collado-Torres L, Nellore A, et al (2017) Reproducible RNA-seq analysis using recount2. Nature biotechnology.
Q. Wang, J Armenia, C. Zhang, A.V. Penson, E. Reznik, L. Zhang, T. Minet, A. Ochoa, B.E. Gross, C. A. Iacobuzio-Donahue, D. Betel, B.S. Taylor, J. Gao, N. Schultz. Unifying cancer and normal RNA sequencing data from different sources. Scientific Data 5:180061, 2018.
Rahman M, et al. (2015) Alternative preprocessing of RNA-Sequencing data in TCGA leads to improved analysis results. Bioinformatics.
The GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. (2013) Nature genetics.
Robert, C. & Watson, M. Errors in RNA-Seq quantification affect genes of relevance to human disease. Genome Biol 16, 177 (2015)
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Tools used for analysis

All our analysis is done in R. We found the following R/Biocondcutor packages extremely useful in our analysis.

SummarizedExperiment for creating and storing TCGA and GTEX data as SE objects.
GenomicRanges for manipulating genomic ranges.
rtracklayer for reading in GTF files quickly as GenomicRanges objects
ggplot2 for making most of the plots in our paper.
pheatmap for making heatmaps
We also used RColorBrewer, UpSetR and eulerr

To ensure smooth execution of code in this repository, please install the following packages

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("SummarizedExperiment",
                       "GenomicRanges", 
                       "rtracklayer", 
                       "ggplot2", 
                       "pheatmap", 
                       "RColorBrewer", 
                       "UpSetR", 
                       "eulerr", 
                       "gridExtra"))

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
01_Acquiring_TCGA_Data_various_sources.Rmd		01_Acquiring_TCGA_Data_various_sources.Rmd
02_Acquiring_GTEx_Data_various_sources.Rmd		02_Acquiring_GTEx_Data_various_sources.Rmd
03_Creating_TPM_TCGA_Data_objects.Rmd		03_Creating_TPM_TCGA_Data_objects.Rmd
04_Creating_TPM_GTEx_Data_objects.Rmd		04_Creating_TPM_GTEx_Data_objects.Rmd
05_RPKM_normalized_Data.Rmd		05_RPKM_normalized_Data.Rmd
06_TPM_normalized_Data.Rmd		06_TPM_normalized_Data.Rmd
07_Discordant_genes_samples.Rmd		07_Discordant_genes_samples.Rmd
07b_fc_discordant_samples.Rmd		07b_fc_discordant_samples.Rmd
07c_Xena_exploration.Rmd		07c_Xena_exploration.Rmd
08_Supp_Tables.Rmd		08_Supp_Tables.Rmd
09_Supp_Figures.Rmd		09_Supp_Figures.Rmd
10_github_tcga_batches.Rmd		10_github_tcga_batches.Rmd
11_github_gtex_batches.Rmd		11_github_gtex_batches.Rmd
12_github_combine_tcga_gtex_data.Rmd		12_github_combine_tcga_gtex_data.Rmd
13_Fig1_PCA_plots.Rmd		13_Fig1_PCA_plots.Rmd
14_Fig2_Pipeline_Differences.Rmd		14_Fig2_Pipeline_Differences.Rmd
15_DEG_DESeq2_analysis.Rmd		15_DEG_DESeq2_analysis.Rmd
16_pathways_analysis.Rmd		16_pathways_analysis.Rmd
17_annotations.Rmd		17_annotations.Rmd
FinalPoster.pdf		FinalPoster.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Additional Figures

Downloading Data

Data folder Organization

Amazon S3 Bucket Data

Download Processed Data

Download complete data in chunks

Complete Data Download

Clone this github repository

MD5SUM for downloaded files

Vignette Overview

1) Acquiring TCGA data

2) Acquiring GTEX data

3) Creating TPM Normalized SE objects for TCGA data.

4) Creating TPM Normalized SE objects for GTEx data

5) PCA using RPKM normalized data

6) PCA using TPM normalized data

7) Discordant Genes

7(b) Differences in absolute log2 fold change of discordant samples within a data source

7(c) Discordant Genes

8) Supplemental Tables

9) Supplemental Figures

10) Batches in TCGA Data

11) Batches in GTEx Data

12) Combining GTEx and TCGA Data

13) Figure 1 of submitted paper

14) Figure 2 of submitted paper

References

Tools used for analysis

About

Releases

Packages

sonali-bioc/UncertaintyRNA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Additional Figures

Downloading Data

Data folder Organization

Amazon S3 Bucket Data

Download Processed Data

Download complete data in chunks

Complete Data Download

Clone this github repository

MD5SUM for downloaded files

Vignette Overview

1) Acquiring TCGA data

2) Acquiring GTEX data

3) Creating TPM Normalized SE objects for TCGA data.

4) Creating TPM Normalized SE objects for GTEx data

5) PCA using RPKM normalized data

6) PCA using TPM normalized data

7) Discordant Genes

7(b) Differences in absolute log2 fold change of discordant samples within a data source

7(c) Discordant Genes

8) Supplemental Tables

9) Supplemental Figures

10) Batches in TCGA Data

11) Batches in GTEx Data

12) Combining GTEx and TCGA Data

13) Figure 1 of submitted paper

14) Figure 2 of submitted paper

References

Tools used for analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages