Skip to content

Latest commit

 

History

History
188 lines (125 loc) · 10.3 KB

reditoolsdnarna.md

File metadata and controls

188 lines (125 loc) · 10.3 KB

Epitranscriptome Course Bari 28-April-2023

The main steps described during the practice are reported below and can be easily copy/pasted in your terminal.
Note. Assuming you're traineeX, please change X according to your workspace.
Note2. Choose a GTEX sample from Cerebellum or Lung and copy DNAseq/RNAseq accordingly in your home folder.
IMPORTANT! REDItoolDnaRna.py outTable (eg. outTable_892028847) contains 9digit random number, so it usually varies among users and different script launches on the same machine.
Due to multiple available versions of the core module pysam, it is possibile that some commands will return you a pysam error.
In those cases just type:

 $ conda activate rnaediting2 
*rnaediting2 environment contains pysam=0.15.2
Type again the command that returned errors and revert to your main environment with:

 $ conda activate rnaediting 
*rnaediting envirnment contains pysam==0.7.7
Sample Tissue Gender Path
SRR1083076.bam Artery - Aorta male normal /data/artery/SRR1083076/SRR1083076.bam/
SRR1091254.bam Artery - Aorta male normal /data/artery/SRR1091254/SRR1091254.bam
SRR1368668.bam Artery - Aorta male normal /data/artery/SRR1368668/SRR1368668.bam
SRR1086680.bam Brain - Cerebellum male normal /data/brain/SRR1086680/SRR1086680.bam
SRR1311771.bam Brain - Cerebellum male normal /data/brain/SRR1311771/SRR1311771.bam

1) Create a folder for your RNAseq data (eg. RNAseq)
$ mkdir RNAseq

2) Copy the RNAseq data inside it
$ cp ../data/rnaediting/lung/RNAseq_fq/SRR1310520_r* .

3) Align RNASeq reads to the reference genome with STAR:
$ STAR --runThreadN 4   --genomeDir /usr/share/course_data/STAR_genome_index_ucsc/   --genomeLoad NoSharedMemory   --readFilesIn ./SRR1310520_r1.fastq.gz   ./SRR1310520_r2.fastq.gz  --readFilesCommand zcat  --outFileNamePrefix SRR1310520_chr21_   --outReadsUnmapped Fastx   --outSAMtype BAM   SortedByCoordinate      --outSAMstrandField intronMotif   --outSAMattributes All      --outFilterType BySJout   --outFilterMultimapNmax 1  --outFilterMismatchNmax 999   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000  --alignMatesGapMax 1000000   --alignSJoverhangMin 8   --alignSJDBoverhangMin 1

$cd ..

4) Create a folder for your DNAseq data (eg. DNAseq)
$ mkdir DNAseq

5) Copy the DNAseq data inside it
$ cp ../data/rnaediting/DNAseq/Lung* .

6) Detect all potential DNA–RNA variants in your data (limited to chromosome 21) using the REDItoolDnaRNA.py script:

$ REDItoolDnaRna.py -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -j ./DNAseq/Lung_sorted.bam -o editing -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa  -c1,1 -m30,255 -v1 -q30,30 -e -n0.0 -N0.0 -u -l -p -s2 -g2  -S -Y chr21:1-48129895
For detailed REDItoolDnaRna.py options click here

7) Exclude invariant positions as well as positions not supported by ≥10 WGS reads:

$ awk 'FS="\t" {if ($8!="-" && $10>=10 && $13=="-") print}' editing/DnaRna_892028847/outTable_892028847 > outTable_892028847_chr21.out

8) Annotate positions using RepeatMasker and dbSNP annotations:

$ AnnotateTable.py -a /usr/share/course_data/rnaediting/rptmsk/rmsk_chr21.sorted.gtf.gz -n rmsk -i outTable_892028847_chr21.out -o outTable_892028847_chr21.out.rmsk -u

$ AnnotateTable.py -a /usr/share/course_data/rnaediting/dbsnp/snp151_chr21.sorted.gtf.gz -n snp151 -n snp151 -i outTable_892028847_chr21.out.rmsk -o outTable_892028847_chr21.out.rmsk.snp -u
For detailed AnnotateTable.py options click here

9) Create a first set of positions selecting sites supported by at least five RNAseq reads and a single mismatch:

$ selectPositions.py -i outTable_892028847_chr21.out.rmsk.snp -c 5 -v 1 -f 0.0 -o outTable_892028847_chr21.out.rmsk.snp.sel1
For detailed selectPositions.py options click here

10) Create a second set of positions selecting sites supported by ≥10 RNAseq reads, three mismatches and minimum editing frequency of 0.1: 

$ selectPositions.py -i outTable_892028847_chr21.out.rmsk.snp -c 10 -v 3 -f 0.1  -o outTable_892028847_chr21.out.rmsk.snp.sel2

11) Select ALU sites from the first set of positions:

$ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)=="Alu" && $17=="-" && $8!="-" && $10>=10 && $13=="-") print}' outTable_892028847_chr21.out.rmsk.snp.sel1 > outTable_892028847_chr21.out.rmsk.snp.alu

12) Select REP NON ALU sites from the second set of positions, excluding sites in Simple repeats or Low complexity regions:

$ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)!="Alu" && $15!="-" && $15!="Simple_repeat" && $15!="Low_complexity" && $17=="-" && $8!="-" && $10>=10 && $14<=0.05 && $9>=0.1) print}' outTable_892028847_chr21.out.rmsk.snp.sel2 > outTable_892028847_chr21.out.rmsk.snp.nonalu

13) Select NON REP sites from the second set of positions:

$ awk 'FS="\t" {if ($1!="chrM" && substr($16,1,3)!="Alu" && $15=="-" && $17=="-" && $8!="-" && $10>=10 && $14<=0.05 && $9>=0.1) print}' outTable_892028847_chr21.out.rmsk.snp.sel2 > outTable_892028847_chr21.out.rmsk.snp.nonrep

14) Annotate ALU, REP NON ALU and NON REP sites using known editing events from REDIportal:

$ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R  -c 1 -i outTable_892028847_chr21.out.rmsk.snp.alu -o outTable_892028847_chr21.out.rmsk.snp.alu.ed -u

$ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R  -c 1 -i outTable_892028847_chr21.out.rmsk.snp.nonalu -o outTable_892028847_chr21.out.rmsk.snp.nonalu.ed -u

$ AnnotateTable.py -a /usr/share/course_data/rnaediting/rediportal/atlas.gtf.gz -n ed -k R  -c 1 -i outTable_892028847_chr21.out.rmsk.snp.nonrep -o outTable_892028847_chr21.out.rmsk.snp.nonrep.ed -u

15) Extract known editing events from ALU, REP NON ALU and NON REP sites:

$ mv outTable_892028847_chr21.out.rmsk.snp.alu.ed alu

$ mv outTable_892028847_chr21.out.rmsk.snp.nonalu.ed nonalu

$ mv outTable_892028847_chr21.out.rmsk.snp.nonrep.ed nonrep

$ cat alu nonalu nonrep > alu-nonalu-nonrep

$ awk 'FS="\t" {if ($19=="ed") print}' alu-nonalu-nonrep > knownEditing 

16) Convert editing candidates ($19!="ed" selects novel RNA editing events.) in REP NON ALU and NON REP sites in GFF format for further filtering:

$ cat nonalu nonrep > nonalu-nonrep

$ awk 'FS="\t" {if ($19!="ed") print}' nonalu-nonrep > pos.txt

$TableToGFF.py -i pos.txt -s -t -o pos.gff
For detailed TableToGFF.py options click here

17) Convert editing candidates in ALU sites in GFF format for further filtering:

$ awk 'FS="\t" {if ($19!="ed") print}' alu > posalu.txt

$ TableToGFF.py -i posalu.txt -s -t -o posalu.gff

18) Launch REDItoolDnaRna.py on ALU sites using stringent criteria to recover potential editing candidates:

$ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 5,5 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 1 -n 0.0 -e -T posalu.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -k /usr/share/course_data/rnaediting/hg19ref/nochr -R -o firstalu

19) Launch REDItoolDnaRna.py on REP NON ALU and NON REP sites using stringent criteria to recover RNAseq reads harboring reference mismatches:

$ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 10,10 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 3 -n 0.1 -e -T pos.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -k /usr/share/course_data/rnaediting/hg19ref/nochr --reads -R --addP -o first

20) Launch pblat on RNAseq reads harboring reference mismatches from previous step and select multimapping reads:

$ pblat -t=dna -q=rna -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa first/DnaRna_304977045/outReads_304977045 reads.psl

$ readPsl.py reads.psl badreads.txt

21) Extract RNAseq reads harboring reference mismatches from Step 19 and remove duplicates:
  
$ sort -k1,1 -k2,2n -k3,3n first/DnaRna_595685947/outReads_595685947 | mergeBed > bed 

$ samtools view -@ 4 -L bed -h -b ./RNAseq/SRR1310520_chr21_Aligned.sortedByCoord.out.bam > SRR1310520_chr21_bed.bam

$ samtools sort -@ 4 -n SRR1310520_chr21_bed.bam -o SRR1310520_chr21_bed_ns.bam 

$ samtools fixmate -@ 4 -m SRR1310520_chr21_bed_ns.bam  SRR1310520_chr21_bed_ns_fx.bam 

$ samtools sort -@ 4 SRR1310520_chr21_bed_ns_fx.bam -o SRR1310520_chr21_bed_ns_fx_st.bam

$ samtools markdup -r -@ 4 SRR1310520_chr21_bed_ns_fx_st.bam SRR1310520_chr21_bed_dedup.bam

$ samtools index SRR1310520_chr21_bed_dedup.bam

22) Re-run REDItoolDnaRna.py on REP NON ALU and NON REP sites using stringent criteria, deduplicated reads and mis-mapping info:

$ REDItoolDnaRna.py -s 2 -g 2 -S -t 4 -i SRR1310520_chr21_bed_dedup.bam -f /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -c 10,10 -q 30,30 -m 255,255 -O 5,5 -p -u -a 11-6 -l -v 3 -n 0.1 -e -T pos.sorted.gff.gz -w /usr/share/course_data/rnaediting/Gencode_annotation/gencode.v30lift37.chr21.splicesites.txt -R -k /usr/share/course_data/rnaediting/hg19ref/nochr -b badreads.txt --rmIndels -o second
  
23) Collect filtered ALU, REP NON ALU and NON REP sites:

$ collect_editing_candidates.py

$ sort -k1,1 -k2,2n editing.txt > editing_sorted.txt

24) Inspect the distribution of editing candidates to look at A-to-I enrichment: 

$ get_Statistics.py