Skip to content

RNAseq from fastq reads to gene counts/quantification

Notifications You must be signed in to change notification settings

isugifNF/RNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNAseq

RNAseq from fastq reads to gene counts/quantification

nextflow run main.nf --help
N E X T F L O W  ~  version 20.07.1
Launching `main.nf` [curious_curran] - revision: 7d9f51e73b
Usage:
   The typical command for running the RNAseq pipeline is as follows:
   nextflow run main.nf --reads "*_{R1,R2}.fastq.gz" --genome GENOME.fasta --genome_gff GENOME.gff --genome_cdna CDNA.fasta  -profile singularity

   Mandatory arguments:
    --reads                 Paired-end reads in fastq.gz format, will need to specify glob (e.g. "*_{R1,R2}.fastq.gz")
    --genome                Reference genome fasta file
    --genome_gff            Reference gff file
    --genome_cdna           Reference cdna file

   Optional configuration arguments:
    --methods               Select one or more RNAseq counting methods [default: 'gsnap,hisat2,kallisto,salmon']
    -profile                Configuration profile to use. Can use multiple (comma separated)
                            Available: local, slurm [default:local]
    --fastqc_app            Link to fastqc executable [default: 'fastqc']
    --multiqc_app           Link to multiqc executable [default: 'multiqc']
    --parallel_app          Link to parallel executable [default: 'parallel']
    --kallisto_app          Link to kallisto executable [default: 'kallisto']
    --salmon_app            Link to salmon executable [default: 'salmon']
    --gsnap_app             Link to gsnap executable [default: 'gsnap']
    --featureCounts_app     Link to featureCounts executable [default: 'featureCounts']

   Optional other arguments:
    --threads               Threads per process [default:4 for local, 16 for slurm]
    --queueSize             Maximum jobs to submit to slurm [default:20]
    --account               HPC account name for slurm sbatch, atlas and ceres requires this
    --help

Example Dataset

Fetch Maize Reads
mkdir 00_Raw-Data; cd 00_Raw-Data;
# ==== Fetch Reads
# === Blade Tissue
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/004/SRR1573504/SRR1573504_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/004/SRR1573504/SRR1573504_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/005/SRR1573505/SRR1573505_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/005/SRR1573505/SRR1573505_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/006/SRR1573506/SRR1573506_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/006/SRR1573506/SRR1573506_2.fastq.gz

# === Blade mutant L1
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/009/SRR1573519/SRR1573519_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/009/SRR1573519/SRR1573519_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/000/SRR1573520/SRR1573520_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/000/SRR1573520/SRR1573520_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/001/SRR1573521/SRR1573521_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/001/SRR1573521/SRR1573521_2.fastq.gz

# === Ligule Tissue
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/007/SRR1573507/SRR1573507_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/007/SRR1573507/SRR1573507_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/008/SRR1573508/SRR1573508_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/008/SRR1573508/SRR1573508_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/009/SRR1573509/SRR1573509_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/009/SRR1573509/SRR1573509_2.fastq.gz

# === Ligule mutant L1
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/002/SRR1573522/SRR1573522_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/002/SRR1573522/SRR1573522_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/003/SRR1573523/SRR1573523_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/003/SRR1573523/SRR1573523_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/004/SRR1573524/SRR1573524_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/004/SRR1573524/SRR1573524_2.fastq.gz

# === Sheath Tissue
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/000/SRR1573510/SRR1573510_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/000/SRR1573510/SRR1573510_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/001/SRR1573511/SRR1573511_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/001/SRR1573511/SRR1573511_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/002/SRR1573512/SRR1573512_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/002/SRR1573512/SRR1573512_2.fastq.gz

# === Sheath mutant L1
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/005/SRR1573525/SRR1573525_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/005/SRR1573525/SRR1573525_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/006/SRR1573526/SRR1573526_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/006/SRR1573526/SRR1573526_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/007/SRR1573527/SRR1573527_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR157/007/SRR1573527/SRR1573527_2.fastq.gz

Reference files

Fetch Zea mays reference files from NCBI

cd 00_Raw-Data

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/167/145/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/167/145/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.gff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/167/145/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_rna.fna.gz

Example Run

N E X T F L O W  ~  version 20.10.0
Launching `main.nf` [naughty_keller] - revision: 6370a27351
executor >  slurm (154)
[65/9f5b8d] process > fastqc (batched)               [100%] 6 of 6 ✔
[a6/f76504] process > multiqc                        [100%] 1 of 1 ✔
[3f/cb5455] process > kallisto_index (GCF_9021671... [100%] 1 of 1 ✔
[96/940514] process > kallisto_quant (SRR1573504)    [100%] 24 of 24 ✔
[2d/ab4e35] process > salmon_index (GCF_902167145... [100%] 1 of 1 ✔
[b5/260d0a] process > salmon_quant (SRR1573509)      [100%] 24 of 24 ✔
[42/d3600c] process > gsnap_index (GCF_902167145)    [100%] 1 of 1 ✔
[17/b68c59] process > gsnap_quant (SRR1573511)       [100%] 24 of 24 ✔
[51/de2e96] process > featureCounts_gene (SRR1573... [100%] 24 of 24 ✔
[5b/3c04f4] process > featureCounts_mRNA (SRR1573... [100%] 24 of 24 ✔
[7c/7ccc62] process > featureCounts_geneMult (SRR... [100%] 24 of 24 ✔
Completed at: 22-Mar-2021 04:51:51
Duration    : 4h 20m 7s
CPU hours   : 71.3
Succeeded   : 154

timeline.html | report.html

RNASeq_Results/
  |_ 01_Quality-Control/     # fastqc and multiqc report
  |_ 03_GSNAP                # gene counts from GSNAP
  |_ 03_Kallisto             # transcript counts from kallisto (ready for sleuth analysis)
  |_ 03_Salmon               # transcript counts from salmon (ready for DESeq2 analysis)
  |_ report.html             # memory usage report
  |_ timeline.html           # runtime report
See example run on Ceres HPC - last update 24 June 2021

Ran in 1 hour and 48 minutes. The below commands were wrapped in a slurm script.

# Ceres Modules
module load nextflow
module load parallel 
module load fastqc
module load python_3
module load salmon/0.10.1
module load kallisto/0.42.4
module load hisat2/2.2.0
module load gmap_gsnap/2020-04-08
module load samtools
module load subread/2.0.2

# Main command
nextflow run main.nf \
  --reads "00_Raw-Data/*{1,2}.fastq.gz" \
  --genome "00_Raw-Data/*_genomic.fna.gz" \
  --genome_gff "00_Raw-Data/*.gff.gz" \
  --genome_cdna "00_Raw-Data/*_rna.fna.gz" \
  --queueSize 25 \
  -profile slurm \
  -resume

Progress messages

N E X T F L O W  ~  version 20.07.1
Launching `main.nf` [elated_kilby] - revision: 7d9f51e73b
executor >  slurm (117)
[5e/2c59b1] process > fastqc (batched)               [100%] 5 of 5 ✔
[90/ed372f] process > multiqc                        [100%] 1 of 1 ✔
[b7/06aa08] process > kallisto_index (GCF_9021671... [100%] 1 of 1 ✔
[62/2027e7] process > kallisto_quant (SRR1573520)    [100%] 18 of 18 ✔
[3e/ebd42f] process > salmon_index (GCF_902167145... [100%] 1 of 1 ✔
[d9/03b7df] process > salmon_quant (SRR1573520)      [100%] 18 of 18 ✔
[4c/ec81ca] process > gsnap_index (GCF_902167145)    [100%] 1 of 1 ✔
[36/237247] process > gsnap_align (SRR1573520)       [100%] 18 of 18 ✔
[5f/54c2e9] process > featureCounts_gene (SRR1573... [100%] 18 of 18 ✔
[c7/f2c642] process > featureCounts_mRNA (SRR1573... [100%] 18 of 18 ✔
[db/3e6c52] process > featureCounts_geneMult (SRR... [100%] 18 of 18 ✔
Completed at: 24-Jun-2021 13:07:28
Duration    : 1h 48m 38s
CPU hours   : 11.5
Succeeded   : 117
See example run on Atlas HPC - last updated 24 June 2021

The below was wrapped in a slurm script, all dependencies did not have modules, so all installed via conda environment. Ran in 1 hour and 36 minutes.

# May need to install miniconda on Atlas (not sure why it's not there...)
[[ -d envs/rnaseq_env ]] || conda env create -f environment.yml -p ${PWD}/envs/rnaseq_env
source activate ${PWD}/envs/rnaseq_env

# Nextflow needs to be installed
[[ -f nextflow ]] || curl -s https://get.nextflow.io | bash

./nextflow run main.nf \
  --reads "00_Raw-Data/*{1,2}.fastq.gz" \
  --genome "00_Raw-Data/02_Genome/*_genomic.fna.gz" \
  --genome_gff "00_Raw-Data/02_Genome/*.gff.gz" \
  --genome_cdna "00_Raw-Data/02_Genome/*rna.fna.gz" \
  --queueSize 50 \
  -profile slurm \
  --account "isu_gif_vrsc" \
  --threads 40 \
  -resume

Progress messages

N E X T F L O W  ~  version 21.04.1
Launching `main.nf` [nice_bartik] - revision: 7d9f51e73b
executor >  slurm (117)
[26/d0a530] process > fastqc (batched)               [100%] 5 of 5 ✔
[7d/cc81d3] process > multiqc                        [100%] 1 of 1 ✔
[83/3ff604] process > kallisto_index (GCF_9021671... [100%] 1 of 1 ✔
[57/651fa9] process > kallisto_quant (SRR1573509)    [100%] 18 of 18 ✔
[7e/6d4bc2] process > salmon_index (GCF_902167145... [100%] 1 of 1 ✔
[db/348442] process > salmon_quant (SRR1573520)      [100%] 18 of 18 ✔
[20/f31fda] process > gsnap_index (GCF_902167145)    [100%] 1 of 1 ✔
[0b/fbe48e] process > gsnap_align (SRR1573520)       [100%] 18 of 18 ✔
[4c/814675] process > featureCounts_gene (SRR1573... [100%] 18 of 18 ✔
[2b/15418d] process > featureCounts_mRNA (SRR1573... [100%] 18 of 18 ✔
[6f/327866] process > featureCounts_geneMult (SRR... [100%] 18 of 18 ✔
Completed at: 24-Jun-2021 17:21:26
Duration    : 1h 36m 34s
CPU hours   : 16.6
Succeeded   : 117

Releases

No releases published

Packages

No packages published