Skip to content

Vignette #3: Annotating microRNAs

Mike Axtell edited this page Sep 16, 2024 · 5 revisions

--known_miRNAs misconceptions

The ShortStack option --known_miRNAs allows users to provide a set of small RNA sequences that are used to guide ShortStack's MIRNA locus annotation. For details see the README and Vignette #1 : A "complete" run. There are some key points about this:

  • --known_miRNAs is NOT a database connection. ShortStack does not have any capability to query miRBase or any other databases to find all previously known microRNAs. ShortStack only sees as "known" what the user provides in the --known_miRNAs file.
  • A sequence from --known_miRNAs can match the reference genome and have aligned small RNA-seq reads and still NOT be annotated as a MIRNA locus by ShortStack. ShortStack examines the predicted secondary structure of the putative precursor to see if it conforms to accepted norms Axtell and Meyers, 2018. ShortStack also examines for the presence of the miRNA/miRNA* duplex in the actual sRNA-seq data with the required precision Axtell and Meyers, 2018. Both of these checks must be passed or else ShortStack will not call the locus a MIRNA.

Reconciling annotations

ShortStack does not have any knowledge of MIRNA nomenclature nor of any previous annotations. It names all of its clusters, MIRNAs included, with simple names like "Cluster_7". Users will often want to reconcile annotations from ShortStack with pre-existing knowledge. There are two general approaches

Use overlaps with known_miRNAs

If --known_miRNAs was used during the ShortStack run, any ShortStack-discovered Clusters that overlap with one or more of those sequences will be noted in the Results.txt file with entries in column "Known_miRNAs". These can be parsed easily in any number of ways. The final column of Results.txt is marked 'NA' if there are no overlaps with sequences from the --known_miRNAs file. Column 20 of Results.txt will be marked 'Y' if ShortStack thinks the cluster is a MIRNA locus, or with an 'N'. Importing Results.txt into R or into Excel allows one to easily filter based on these columns. One can also just use awk one-liners, as demonstrated below:

# Get all clusters that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA"' Results.txt

# Get all clusters that were called MIRNA by ShortStack and that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA" && $20 == "Y"' Results.txt

# Get all cluster that were NOT called MIRNA by ShortStack and that overlap with one or more known_miRNAs sequences
awk -F'\t' '$NF != "NA" && $20 == "N"' Results.txt

The names of the matching known_miRNAs will provide a good start on proper naming and annotation of the MIRNA loci.

Use bedtools intersect with gff3/bed of known MIRNA loci

A more effective method to annotate is to intersect locations of ShortStack-discovered Clusters with the genomic locations of previously known MIRNA loci from the genome. This is more effective compared to intersecting against mature miRNA alignments because mature miRNAs are often encoded by multiple paralogs in a given genome. In contrast, the MIRNA precursor is unique. This requires a bed- or gff3-formatted file of previously known MIRNA loci. This will have to be provided from some other annotation source. One way to accomplish this would be to blast the relevant hairpin sequences from miRBase against the reference genome of interest and re-format the top / exact hits into bed of gff3 format.

bedtools intersect may be used to find overlaps between ShortStack-called de novo clusters and the previously known hairpins. Note that the bedtools suite is required for ShortStack so bedtools will be installed in the ShortStack environment. This uses the Results.gff3 file made by ShortStack and the bed or gff3 file that the user provides listing genomic locations of previously annotated MIRNA hairpins.

bedtools intersect -loj -s -a [annotated_hairpins.gff3/bed] -b Results.gff3 

The example command above will perform a "left outer join" such that each entry from the [annotated_hairpins.gff3/bed] will be listed at least once, and every overlap with entries from Results.gff3 (from ShortStack) is reported. The -s switch requires overlaps to have the same genomic strand. bedtools intersect is quite flexible and users should consult the manual for further instructions.