-
Notifications
You must be signed in to change notification settings - Fork 30
Vignette #3: Annotating microRNAs
The ShortStack option --known_miRNAs
allows users to provide a set of small RNA sequences that are used to guide ShortStack's MIRNA locus annotation. For details see the README and Vignette #1 : A "complete" run. There are some key points about this:
-
--known_miRNAs
is NOT a database connection. ShortStack does not have any capability to query miRBase or any other databases to find all previously known microRNAs. ShortStack only sees as "known" what the user provides in the--known_miRNAs
file. - A sequence from
--known_miRNAs
can match the reference genome and have aligned small RNA-seq reads and still NOT be annotated as a MIRNA locus by ShortStack. ShortStack examines the predicted secondary structure of the putative precursor to see if it conforms to accepted norms Axtell and Meyers, 2018. ShortStack also examines for the presence of the miRNA/miRNA* duplex in the actual sRNA-seq data with the required precision Axtell and Meyers, 2018. Both of these checks must be passed or else ShortStack will not call the locus a MIRNA.
ShortStack does not have any knowledge of MIRNA nomenclature nor of any previous annotations. It names all of its clusters, MIRNAs included, with simple names like "Cluster_7". Users will often want to reconcile annotations from ShortStack with pre-existing knowledge. There are two general approaches
If --known_miRNAs
was used during the ShortStack run, any ShortStack-discovered Clusters that overlap with one or more of those sequences will be noted in the Results.txt
file with entries in column "Known_miRNAs". These can be parsed easily in any number of ways. The final column of Results.txt
is marked 'NA' if there are no overlaps with sequences from the --known_miRNAs
file. Column 20 of Results.txt
will be marked 'Y' if ShortStack thinks the cluster is a MIRNA locus, or with an 'N'. Importing Results.txt
into R or into Excel allows one to easily filter based on these columns. One can also just use awk
one-liners, as demonstrated below:
# Get all clusters that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA"' Results.txt
# Get all clusters that were called MIRNA by ShortStack and that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA" && $20 == "Y"' Results.txt
# Get all cluster that were NOT called MIRNA by ShortStack and that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA" && $20 == "N"' Results.txt
The names of the matching known_miRNAs will provide a good start on proper naming and annotation of the MIRNA loci.
A more effective method to annotate is to intersect locations of ShortStack-discovered Clusters with the genomic locations of previously known MIRNA loci from the genome. This is more effective compared to intersecting against mature miRNA alignments because mature miRNAs are often encoded by multiple paralogs in a given genome. In contrast, the MIRNA precursor is unique. This requires a bed- or gff3-formatted file of previously known MIRNA loci. This will have to be provided from some other annotation source. One way to accomplish this would be to blast the relevant hairpin sequences from miRBase against the reference genome of interest and re-format the top / exact hits into bed of gff3 format.
bedtools intersect
may be used to find overlaps between ShortStack-called de novo clusters and the previously known hairpins. Note that the bedtools
suite is required for ShortStack so bedtools
will be installed in the ShortStack environment. This uses the Results.gff3
file made by ShortStack and the bed or gff3 file that the user provides listing genomic locations of previously annotated MIRNA hairpins.
bedtools intersect -loj -s -a [annotated_hairpins.gff3/bed] -b Results.gff3
The example command above will perform a "left outer join" such that each entry from the [annotated_hairpins.gff3/bed] will be listed at least once, and every overlap with entries from Results.gff3
(from ShortStack) is reported. The -s
switch requires overlaps to have the same genomic strand. bedtools intersect
is quite flexible and users should consult the manual for further instructions.