The main steps described during the practice are reported below and can be easily copy/pasted in your terminal.
Note. Assuming you're traineeX, please change X according to your workspace.
Note2. Choose a GTEX sample from Cerebellum or Lung and copy DNAseq/RNAseq accordingly in your home folder.
IMPORTANT! REDItoolDnaRna.py outTable (eg. outTable_892028847) contains 9digit random number, so it usually varies among users and different script launches on the same machine.
Due to multiple available versions of the core module pysam, it is possibile that some commands will return you a pysam error.
In those cases just type:
$ conda activate rnaediting2*rnaediting2 environment contains pysam=0.15.2
Type again the command that returned errors and revert to your main environment with:
$ conda activate rnaediting*rnaediting envirnment contains pysam==0.7.7
Sample | Tissue | Gender | Path |
---|---|---|---|
SRR1083076.bam | Artery - Aorta | male normal | /data/artery/SRR1083076/SRR1083076.bam/ |
SRR1091254.bam | Artery - Aorta | male normal | /data/artery/SRR1091254/SRR1091254.bam |
SRR1368668.bam | Artery - Aorta | male normal | /data/artery/SRR1368668/SRR1368668.bam |
SRR1086680.bam | Brain - Cerebellum | male normal | /data/brain/SRR1086680/SRR1086680.bam |
SRR1311771.bam | Brain - Cerebellum | male normal | /data/brain/SRR1311771/SRR1311771.bam |
1) Create a folder for your RNAseq data (eg. RNAseq) $ mkdir RNAseq 2) Copy the RNAseq data inside it $ cp ../data/rnaediting/lung/RNAseq_fq/SRR1310520_r* . 3) Align RNASeq reads to the reference genome with STAR: $ STAR --runThreadN 4 --genomeDir /usr/share/course_data/STAR_genome_index_ucsc/ --genomeLoad NoSharedMemory --readFilesIn ./SRR1310520_r1.fastq.gz ./SRR1310520_r2.fastq.gz --readFilesCommand zcat --outFileNamePrefix SRR1310520_chr21_ --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outSAMattributes All --outFilterType BySJout --outFilterMultimapNmax 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 $cd .. 4) Create a folder for your DNAseq data (eg. DNAseq) $ mkdir DNAseq 5) Copy the DNAseq data inside it $ cp ../data/rnaediting/DNAseq/Lung* . 6) Detect all potential DNA–RNA variants in your data (limited to chromosome 21) using the REDItoolDnaRNA.py script: $ REDItoolDnaRna.py -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -j ./DNAseq/Lung_sorted.bam -o editing -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c1,1 -m30,255 -v1 -q30,30 -e -n0.0 -N0.0 -u -l -p -s2 -g2 -S -Y chr21:1-48129895 For detailed REDItoolDnaRna.py options click here 7) Exclude invariant positions as well as positions not supported by ≥10 WGS reads: $ awk 'FS="\t" {if ($8!="-" && $10>=10 && $13=="-") print}' editing/DnaRna_892028847/outTable_892028847 > outTable_892028847_chr21.out 8) Annotate positions using RepeatMasker and dbSNP annotations: $ AnnotateTable.py -a /usr/share/course_data/rnaediting/rptmsk/rmsk_chr21.sorted.gtf.gz -n rmsk -i outTable_892028847_chr21.out -o outTable_892028847_chr21.out.rmsk -u $ AnnotateTable.py -a /usr/share/course_data/rnaediting/dbsnp/snp151_chr21.sorted.gtf.gz -n snp151 -n snp151 -i outTable_892028847_chr21.out.rmsk -o outTable_892028847_chr21.out.rmsk.snp -u For detailed AnnotateTable.py options click here 9) Create a first set of positions selecting sites supported by at least five RNAseq reads and a single mismatch: $ selectPositions.py -i outTable_892028847_chr21.out.rmsk.snp -c 5 -v 1 -f 0.0 -o outTable_892028847_chr21.out.rmsk.snp.sel1 For detailed selectPositions.py options click here 10) Create a second set of positions selecting sites supported by ≥10 RNAseq reads, three mismatches and minimum editing frequency of 0.1: $ selectPositions.py -i outTable_892028847_chr21.out.rmsk.snp -c 10 -v 3 -f 0.1 -o outTable_892028847_chr21.out.rmsk.snp.sel2 11) Select ALU sites from the first set of positions: $ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)=="Alu" && $17=="-" && $8!="-" && $10>=10 && $13=="-") print}' outTable_892028847_chr21.out.rmsk.snp.sel1 > outTable_892028847_chr21.out.rmsk.snp.alu 12) Select REP NON ALU sites from the second set of positions, excluding sites in Simple repeats or Low complexity regions: $ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)!="Alu" && $15!="-" && $15!="Simple_repeat" && $15!="Low_complexity" && $17=="-" && $8!="-" && $10>=10 && $14<=0.05 && $9>=0.1) print}' outTable_892028847_chr21.out.rmsk.snp.sel2 > outTable_892028847_chr21.out.rmsk.snp.nonalu 13) Select NON REP sites from the second set of positions: $ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)!="Alu" && $15=="-" && $17=="-" && $8!="-" && $10>=10 && $14<=0.05 && $9>=0.1) print}' outTable_892028847_chr21.out.rmsk.snp.sel2 > outTable_892028847_chr21.out.rmsk.snp.nonrep 14) Annotate ALU, REP NON ALU and NON REP sites using known editing events from REDIportal: $ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R -c 1 -i outTable_892028847_chr21.out.rmsk.snp.alu -o outTable_892028847_chr21.out.rmsk.snp.alu.ed -u $ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R -c 1 -i outTable_892028847_chr21.out.rmsk.snp.nonalu -o outTable_892028847_chr21.out.rmsk.snp.nonalu.ed -u $ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R -c 1 -i outTable_892028847_chr21.out.rmsk.snp.nonrep -o outTable_892028847_chr21.out.rmsk.snp.nonrep.ed -u 15) Extract known editing events from ALU, REP NON ALU and NON REP sites: $ mv outTable_892028847_chr21.out.rmsk.snp.alu.ed alu $ mv outTable_892028847_chr21.out.rmsk.snp.nonalu.ed nonalu $ mv outTable_892028847_chr21.out.rmsk.snp.nonrep.ed nonrep $ cat alu nonalu nonrep > alu-nonalu-nonrep $ awk 'FS="\t" {if ($19=="ed") print}' alu-nonalu-nonrep > knownEditing 16) Convert editing candidates ($19!="ed" selects novel RNA editing events.) in REP NON ALU and NON REP sites in GFF format for further filtering: $ cat nonalu nonrep > nonalu-nonrep $ awk 'FS="\t" {if ($19!="ed") print}' nonalu-nonrep > pos.txt $TableToGFF.py -i pos.txt -s -t -o pos.gff For detailed TableToGFF.py options click here 17) Convert editing candidates in ALU sites in GFF format for further filtering: $ awk 'FS="\t" {if ($19!="ed") print}' alu > posalu.txt $ TableToGFF.py -i posalu.txt -s -t -o posalu.gff 18) Launch REDItoolDnaRna.py on ALU sites using stringent criteria to recover potential editing candidates: $ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 5,5 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 1 -n 0.0 -e -T posalu.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -k /usr/share/course_data/rnaediting/hg19ref/nochr -R -o firstalu 19) Launch REDItoolDnaRna.py on REP NON ALU and NON REP sites using stringent criteria to recover RNAseq reads harboring reference mismatches: $ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 10,10 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 3 -n 0.1 -e -T pos.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -k /usr/share/course_data/rnaediting/hg19ref/nochr --reads -R --addP -o first 20) Launch pblat on RNAseq reads harboring reference mismatches from previous step and select multimapping reads: $ pblat -t=dna -q=rna -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa first/DnaRna_304977045/outReads_304977045 reads.psl $ readPsl.py reads.psl badreads.txt 21) Extract RNAseq reads harboring reference mismatches from Step 19 and remove duplicates: $ sort -k1,1 -k2,2n -k3,3n first/DnaRna_595685947/outReads_595685947 | mergeBed > bed $ samtools view -@ 4 -L bed -h -b ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam > SRR1310520_chr21_bed.bam $ samtools sort -@ 4 -n SRR1310520_chr21_bed.bam -o SRR1310520_chr21_bed_ns.bam $ samtools fixmate -@ 4 -m SRR1310520_chr21_bed_ns.bam SRR1310520_chr21_bed_ns_fx.bam $ samtools sort -@ 4 SRR1310520_chr21_bed_ns_fx.bam -o SRR1310520_chr21_bed_ns_fx_st.bam $ samtools markdup -r -@ 4 SRR1310520_chr21_bed_ns_fx_st.bam SRR1310520_chr21_bed_dedup.bam $ samtools index SRR1310520_chr21_bed_dedup.bam 22) Re-run REDItoolDnaRna.py on REP NON ALU and NON REP sites using stringent criteria, deduplicated reads and mis-mapping info: $ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i SRR1310520_chr21_bed_dedup.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 10,10 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 3 -n 0.1 -e -T pos.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -R -k /usr/share/course_data/rnaediting/hg19ref/nochr -b badreads.txt --rmIndels -o second 23) Collect filtered ALU, REP NON ALU and NON REP sites: $ collect_editing_candidates.py $ sort -k1,1 -k2,2n editing.txt > editing_sorted.txt 24) Inspect the distribution of editing candidates to look at A-to-I enrichment: $ get_Statistics.py