This project tries to reproduce the results mentioned in the paper "Ovarian Carcinoma-Associated Mesenchymal Stem Cells Arise from Tissue-Specific Normal Stroma" with the data provided by this paper.
To get the gene level counts matrices, two different methods were adopted:
- The first one was the method mentioned by the paper: align the fastq files using Bowtie2, then get the gene level counts matrices using htseq-count.
- The second one was using a relatively new software Salmon. Salmon generates transcript level counts, but they can be rolled up to gene level counts using tximport, a R package.
Comparing to the pipeline Bowtie2-htseq, Salmon is way much faster. The Bowtie2-htseq pipeline was initiated in the afternoon of July 7, and it is still running now. It suppose to get its job done on July 11. However, it only took Salmon around three hours to get all the matrices! Though the total number of genes found by Salmon is less than the total number of genes detected by Bowtie2-htseq, Salmon is around 36,000 and Bowtie2-htseq is around 54,000, Salmon is a reliable tool. Illumina DRAGEN Secondary Analysis adopted salmon as part of its RNA-Seq pipeline.
To run fgsea, two Differential Gene Expression(DGE) analysis tools were employed, limma and DESeq2. Not sure why the project team chose limma as the DGE tool, for limma was originally designed for microarray data and DESeq2 is the most popular DGE tool for RNA-Seq data. The paper said they found 27 most enriched genes using fgsea based on their Enrichment Scores(ES). However, fgsea ONLY assigns Enrichment Scores to different gene sets, but not a single gene. So, how the 27 genes were selected remains a question to me.
GSEA developed by Broad Institute was used for gene set enrichment analysis as well.
For both GSEA and fgsea, three different MSigDB were used:
- h.all.v2023.1.Hs.symbols.gmt (Hallmark)
- c2.cp.kegg.v2023.1.Hs.symbols.gmt (KEGG)
- c5.all.v2023.1.Hs.symbols.gmt (GO)
. ├── pipeline │ ├── bowtie2 │ └── tophat2 ├── raw_data │ ├── SRR7702228 │ ├── SRR7702229 │ ├── SRR7702230 │ ├── SRR7702231 │ ├── SRR7702232 │ ├── SRR7702233 │ ├── SRR7702234 │ ├── SRR7702235 │ ├── SRR7702236 │ ├── SRR7702237 │ ├── SRR7702238 │ ├── SRR7702239 │ ├── SRR7702240 │ ├── SRR7702241 │ ├── SRR7702242 │ ├── SRR7702243 │ ├── SRR7702244 │ ├── SRR7702245 │ ├── SRR7702246 │ ├── SRR7702247 │ ├── SRR7702248 │ ├── SRR7702249 │ ├── SRR7702250 │ ├── SRR7702251 │ ├── SRR7702252 │ ├── SRR7702253 │ ├── SRR7702254 │ ├── SRR7702255 │ ├── SRR7702256 │ ├── SRR7702257 │ ├── SRR7702258 │ ├── SRR7702259 │ ├── SRR7702260 │ ├── SRR7702261 │ ├── SRR7702262 │ ├── SRR7702263 │ ├── SRR7702264 │ └── SRR7702265 ├── reference_data │ ├── bowtie2Index │ ├── fgsea │ ├── gtf │ ├── ref │ ├── salmonIndex_genome │ └── salmonIndex_transcript ├── results │ ├── fastqc │ ├── salmon │ ├── salmon_counts_reports │ └── tophat2 └── scripts
On the first level of the tree, there are five folders:
- pipeline: contains the software required for fastq alignment
- raw_data: contains the fastq files of each sample
- reference_data: contains the files which are required to run a software successfully
- results: this directory contains the results from each software/pipeline. And it has the following sub-directories:
- fastqc: this directory contains all the fastqc reports for each fastq file
- salmon: the gene counts matrices generated by the software salmon are stored in this folder
- salmon_counts_report: this directory has all the reports generated based on the gene counts matrices from salmon. Those reports include:
- fgsea: gene set enrichment reports generated by using the R package fgsea
- General: EDA report and Differential Gene Expression (DGE) report generated using R code developed by Li
- GSEA: gene set enrichment reports generated by the original GSEA software
- tophat2: this directory has the gene counts generated by the tophat2-htseq pipeline
- tophat2_counts_reports: this direcotry suppose to has the same structure as the folder "salmon_counts_report". But, due to the slow pace of the tophat2-htseq pipeline, the results may not be generated on time. Thus, this folder may not exist in the structure tree.
- scripts: it has all the R scripts, Python script and bash scripts for downloading data, analyzing data.
However, not all the folders can be found in the github due to the 100m limitation on the uploading files. For example, the reference_data directory was not uploaded.
- Homo_sapiens.GRCh37.dna.primary_assembly.fa
- Homo_sapiens.GRCh37.cdna.all.fa
- Homo_sapiens.GRCh37.87.gtf
The following software were used for this project:
- Bowtie2-2.2.3
- TopHat2-2.0.13
- htseq-count-2.0.3
- salmon-1.4.0
- fgsea-1.26.0
- GSEA-4.3.2
- limma-3.56.2
- DESeq2-1.40.2
- R-4.3.1