Skip to content

Latest commit

 

History

History
838 lines (795 loc) · 35.3 KB

Rediportal.md

File metadata and controls

838 lines (795 loc) · 35.3 KB

EPITRAN workshop -RNA Editing 24/06/21 MORNING SESSION

Searching into REDIportal is quite straightforward and also users with no bioinformatics skills can perform accurate searches across the database. RNA editing sites are stored according to their genomic positions and can be retrieved providing a genomic locus (“Genomic Region” field) or a known gene symbol (“Gene Name” field). Both fields are mutually exclusive. Genomic loci can be interrogated entering chromosome coordinates in the format Chr:start-end (for example chr4:158101247-158308846).
Note. Today we are going to extract all the editing events related to the genes GRIA2 and FLNA.
The protein encoded by the FLNA gene is an actin-binding protein that is involved in remodeling the cytoskeleton to effect changes in cell shape and migration.
GRIA2 gene product belongs to a family of glutamate receptors.Human and animal studies suggest that pre-mRNA editing is essential for brain function, and defective GRIA2 RNA editing may be relevant to amyotrophic lateral sclerosis (ALS) etiology.

FLNA and GRIA2 annotations will be adapted to be used with the REDItoolKnown.py script (see below).
This script has been developed to explore the RNA editing potential of RNA-Seq data sets using only known editing events.




RNA editing events in known genes can be retrieved entering the gene symbol in the “Gene Name” field. To avoid editing sites in intergenic regions surrounding the entered gene name, the “Extact Match” check box must be selected. The “Gene Name” field allows the autocomplete function to facilitate the selection of right gene.




Once the genomic region or gene name has been entered, the search can be refined using additional select menus. The following options are admitted:

Menu Name Option
Location
  • ALU
  • NONREP
  • REP

Location menu allows the selection of RNA editing sites residing in Alu elements (ALU) or repetitive elements non-Alu (REP) or non repetitive regions (NONREP).

Genic Region
  • 5'UTR
  • 3'UTR
  • Intronic
  • Intergenic
  • Exonic

This menu allows the selection of RNA editing sites residing in specific genic regions such as: untranslated regions (UTR) or intronic regions or coding/non-coding exons or intergenic regions. Reported classification has been carried out by ANNOVAR.

AA Change
  • Synonymous
  • Nonsynonymous
  • Stop Loss
  • Unknown

This menu allows the selection of RNA editing sites residing in protein coding regions and affecting codon integrity. Reported classification has been carried out by ANNOVAR.

Tissue
  • Adipose Tissue
  • Adrenal Gland
  • Blood
  • ...
  • Thyroid

This menu allows the selection of RNA editing sites residing in specific human tissues. More than one tissue can be selected per each search. Tissue names are according to GTEx.

Body Site
  • Brain - Hypothalamus
  • Brain - Substantia nigra
  • ...
  • Whole Blood

This menu allows the selection of RNA editing sites residing in specific human body sites. More than one body site can be selected per each search. Body site names are according to GTEx.

A search example can be performed clicking on "Example" button. All searches, instead, are activated by clicking the "Search" button. The search form can also be reset by clicking the "Clean" button.

Once a search has been performed, results will be displayed in a table including the following columns:




Column Name Meaning
Chr

Chromosome Name

Position

Chromosome Coordinate

Ref

Reference Nucleodite

Ed

Edited Nucleotide

Strand

Strand (+ or -)

dbSNP

a colored flag indicating the presence of a SNP in dbSNP. Only SNPs classified as "genomic" are taken into account. A green flag indicates a match with dnSNP and provides also an external link to NCBI

Location

Location of RNA Editing in repetitive or non-repetitive regions.

Repeats

Class and family of repeat including the RNA editing position.

Gene

Gene Symbol

Region

Genic Region according to ANNOVAR

EditedIn

The number of Samples in which the specific position appears to be edited. It is showed by a progression bar.

ExFun

Exonic function limited to synonymous and non-synonymous positions. A colored flag is used to indicate if a site is synonymous (green) or non-synonymous (red). Click on to open a pop-up with details.

Phast

PhastCons conservation scores calculated for multiple alignments of 45 vertebrate genomes to the human genome. It ranges from 0 (no conservation) to 1000 (max conservation). Values derive from UCSC phastCons46way table.

KnownIn

A colored flag indicating the presence of a site in other available database (A: ATLAS, R: RADAR, D: DARNED). Click on R or D to open an external link to RADAR or DARNED databases, respectively.

For each position, REDIportal provides additional info by clicking on blue arrow in the first column. This will cause the opening of four tabs. The first tab named "Heat-Map" displays an RNA Editing heat-map in which mean editing level per body site is reported. Mouse over each body site to open a tooltip showing the average editing level.




The second tab named "Box Plot" displays RNA Editing levels per each body site by means of box plots. Relevant values are available by mousing over each box plot.




The third tab named "Alternative Annotations" displays a table with gene/transcript annotations from RefSeq database and UCSC KnownGene table.




The last tab named "Editing Details" displays the number of samples, tissues and body sites in which the position appears to be edited. Clicking on "View Editing Details" button will cause the opening of a new windows with a table including editing levels per each experiment.




The "View Editing Details" button enables the opening of a new windows including relevant editing info described in the table below.




Column Name Meaning
RNAseq Run

RNAseq Run accession number according to SRA database.

WGS Run

Whole Genome Sequencing Run accession number according to SRA database.

Tissue

Tissue Name according to GTEx project.

BodySite

Body Site Name according to GTEx project.

n.As

Number of RNAseq reads supporting Adenosine

n.Gs

Number of RNAseq reads supporting Guanosine

EditingFreq

RNA Editing Frequecy

gCoverage

Number of supporting genomic reads

gFreq

Max Frequency of AG change at genomic level.

Users can increase the number of visible rows by using the "Show" button.




Also in the main result table, specific columns can be hided by clicking on "Column visibility" button.




Search results can be downloaded using the "Download" button. This will cause the opening of a pop-up in which users can select columns to download.




Columns of each result table can be exchanged or moved in order to customize the aspect and column order.




Columns with gray arrows are sortable in ascending or descending order.




Practical part

After searching for GRIA2 and FLNA genes use the "Download" button embedded with the Results table and select the right columns to obtain a table separated file compatible with the REDItoolKnown.py script
See below for further details on this file format.

The main steps described during the practice are reported below and can be easily copy/pasted in your terminal.
Note. Assuming you're traineeX, please change X according to your workspace.
IMPORTANT! REDItoolKnown.py outTable (eg. outTable_892028847) contains 9digit random number, so it usually varies among users and different script launches on the same machine.
Due to multiple available versions of the core module pysam, it is possibile that some commands will return you a pysam error. In those cases just type:

$ conda activate rnaediting2

*rnaediting2 environment contains pysam=0.15.2 Type again the command that returned errors and revert to your main environment with:

$ conda activate rnaediting

*rnaediting envirnment contains pysam==0.7.7

1) Log into your area and create two separate folders for strand-oriented and unstranded RNAseq data (eg. RNAseq_strnd)

 
$ mkdir RNAseq_strnd
$ cd ..
$ mkdir RNAseq_unstrnd
$ cd ..

2) According to each folder copy the RNAseq data from Editing_knwn folder

$ cd RNAseq_unstrnd
$ cp /usr/share/course_data/rnaediting/Editing_knwn/Unstrnd/*.bam* .
$ cd ..
$ cd RNAseq_strnd/
$ cp /usr/share/course_data/rnaediting/Editing_knwn/Strndd/*.bam* .
$ cd ..
Note.If unable to upload them from your computer to your home folder, you can recover the tab files from /usr/share/course_data/rnaediting/Editing_knwn/, by entering your workspace and giving the command:
$ cp /usr/share/course_data/rnaediting/Editing_knwn/*.gz* .

3) Launch REDItoolKnown.py on RNAseq unstranded data and check them ONLY for those known positions extracted previously from REDIportal (e.g GRIA2)

$ cd RNAseq_unstrnd
$ REDItoolKnown.py -i Cerebellum_unstrnd.bam -f  /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -l ../GRIA2nrptAtlasTable.txt.gz  
$ REDItoolKnown.py -i Lung_unstrnd.bam -f  /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -l ../FLNAAtlasTable.txt.gz
$ cd ..

4) Launch REDItoolKnown.py on RNAseq strand-oriented data and check them ONLY for those known positions extracted previously from REDIportal (e.g GRIA2)

$ cd RNAseq_strnd
$ REDItoolKnown.py -i SRR-6H_HD.bam -f  /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -l ../GRIA2nrptAtlasTable.txt.gz -s2 -S
$ REDItoolKnown.py -i SRR-6H_HD.bam -f  /usr/share/course_data/rnaediting/hg19ref/GRCh37.primary_assembly.genome.fa -l ../AGBL4AtlasTable.txt.gz -s2 -S
$ cd ..

TAB

TAB files are simple textual files with at least three tabulated columns including:

  • genomic region (generally the chromosome name according to the reference genome)
  • coordinate of the position (1-based)
  • strand (+ or -). You can also indicate strand by 0 (strand -), 1 (strand +) or 2 (+ and - or unknown)
genomic region coordinate strand
chr21 10205589 -
chr21 10205629 -
chr21 15411496 +
chr21 15412990 +
chr21 15414553 +
chr21 15415901 +
chr21 15417667 +
chr21 15423330 +

TAB files must be coordinate sorted. In unix/linux environment they can be sorted by the sort command:

sort -k1,1 -k2,2n mytable.txt > mytable.sorted.txt

REDItoolKnown.py

REDItoolKnown.py has been developed to explore the RNA editing potential of RNA-Seq data sets using known editing events. Such events can be downloaded from REDIportal database or generated from supplementary materials of a variety of publications. Known RNA editing events have to be stored in TAB files (see above for details).

Options:
-i BAM file
-I Sort input BAM file
-f Reference in fasta file
-l List of known RNA editing events
-C Base interval to explore [100000]
-k List of chromosomes to skip separated by comma or file
-t Number of threads [1]
-o Output folder [rediFolder_XXXX] in which all results will be stored. XXXX is a random number generated at each run.
-F Internal folder name [null] is the main folder containing output tables.
-c Min. read coverage [10]
-q Minimum quality score [25]
-m Minimum mapping quality score [25]
-O Minimum homoplymeric length [5]
-s Infer strand (for strand oriented reads) [1]. It indicates which read is in line with RNA. Available values are: 1:read1 as RNA,read2 not as RNA; 2:read1 not as RNA,read2 as RNA; 12:read1 as RNA,read2 as RNA; 0:read1 not as RNA,read2 not as RNA.
-g Strand inference type 1:maxValue 2:useConfidence [1]; maxValue: the most prominent strand count will be used; useConfidence: strand is assigned if over a prefixed frequency confidence (-x option)
-x Strand confidence [0.70]
-S Strand correction. Once the strand has been inferred, only bases according to this strand will be selected.
-G Infer strand by GFF annotation (must be sorted, otherwise use -X). Sorting requires grep and sort unix executables.
-X Sort annotation files. It requires grep and sort unix executables.
-K File with positions to exclude (chromosome_name coordinate)
-e Exclude multi hits
-d Exclude duplicates
-p Use paired concardant reads only
-u Consider mapping quality
-T Trim x bases up and y bases down per read [0-0]
-B Blat folder for correction
-U Remove substitutions in homopolymeric regions
-v Minimum number of reads supporting the variation [3]
-n Minimum editing frequency [0.1]
-E Exclude positions with multiple changes
-P File containing splice sites annotations (SpliceSite file format see above for details)
-r Num. of bases near splice sites to explore [4]
-h Print the help

Example:

REDItoolKnown.py -i rnaseq.bam -f reference.fa -l knownEditingSites.tab