A tool for simplifying snpEff annotations
- python 2.7
- PyVcf python module
- VCF file annotated with snpEff v4.3+.
simple_sv_annotation.py
is designed around the new ANN annotation field rather than the previous EFF field.
usage: simple_sv_annotation [options] vcf
Required arguments
vcf FILE - vcf file annotated with snpEff v4.1g+
Optional arguments
--output/-o FILE - Output file name. Use dash (-) for stdout. Default: <invcf>.simpleann.vcf.
--exonNums/-e FILE - List of custom exon numbers (see Alternate Exon Numbers)
--gene_list/-g FILE - List of genes to prioritise on
--known_fusion_pairs/-k FILE - Comma delimited file with a gene pair on each row representing known fusion pairs
This program is distributed under the MIT licence save for the exception below.
The file fusion_pairs.txt
provided here is an extract of the file at https://github.com/ndaniel/fusioncatcher/blob/master/bin/generate_known.py and is redistributed here under the GNU GPLv3.
Occasionally the exon numbering scheme provided by snpEff is incorrect. snpEff numbers the exons in a transcript sequentially, but sometimes the accepted exon numbering is not sequential. For example, BRCA1 transcript 1, NM_007294, does not have an exon 4.
simple_sv_annotation.py
accepts a BED
file in which a user can provide custom numbering for a particular transcript. If
a variant is annotated with a transcript listed in this file, the exon numbers
provided by snpEff are replaced with the exon numbers in the file. If a
transcript is not in the file, then the snpEff exon numbers are used. Follow the
format below, separating each field with a tab
chr17 41196311 41197819 NM_007294|24
chr17 41199659 41199720 NM_007294|23
chr17 41201137 41201211 NM_007294|22
chr17 41203079 41203134 NM_007294|21
chr17 41209068 41209152 NM_007294|20
chr17 41215349 41215390 NM_007294|19
chr17 41215890 41215968 NM_007294|18
chr17 41219624 41219712 NM_007294|17
chr17 41222944 41223255 NM_007294|16
chr17 41226347 41226538 NM_007294|15
chr17 41228504 41228631 NM_007294|14
chr17 41234420 41234592 NM_007294|13
chr17 41242960 41243049 NM_007294|12
chr17 41243451 41246877 NM_007294|11
chr17 41247862 41247939 NM_007294|10
chr17 41249260 41249306 NM_007294|9
chr17 41251791 41251897 NM_007294|8
chr17 41256138 41256278 NM_007294|7
chr17 41256884 41256973 NM_007294|6
chr17 41258472 41258550 NM_007294|5
chr17 41267742 41267796 NM_007294|3
chr17 41276033 41276132 NM_007294|2
chr17 41277287 41277500 NM_007294|1
In the fourth column, provide the transcript name followed by a "|
"
and then the exon number. Note that the transcript version is not used.
You may have additional fields in the bed file, simple_sv_annotation.py
will only consider the first four.
Note: currently this list of alternate exons is stored in memory because it is expected to be relatively small. Very large lists of alternate exon numbering may affect performance.
simple_sv_annotation.py
will attempt to simplify interesting and easy
SV types to make the annotation result more interpretable. If you have an
additional SV type that you want to be able to simplify, please email David
Jenkins, AZ Email or BU Email.
- Intergenic SVs
- Intronic SVs
- Whole Exon Loss SVs
- Gene Fusions (can result from BND/DEL/INV/DUP)
Examples of the simplified SV annotations are below.
simple_sv_annotation.py
has been tested on annotated vcf output files from
the following SV callers:
Additional SV callers will also work with simple_sv_annotation.py
if VCF
specifications are followed and each SV is described with standard SV INFO fields:
- SVTYPE
- MATEID (for SVTYPE=BND)
- END (for whole exon deletions)
Primary output for simple_sv_annotation.py
:
In the default mode, simple_sv_annotation.py
will not alter the ANN field
provided by snpEff. Instead an additional field called SIMPLE_ANN will be added
to the SV call. A SIMPLE_ANN will only be added to variants that can be
simplified, other variants are not altered.
There are six fields in the SIMPLE_ANN tag separated by "|
".
- SV type (deletion, duplication, insertion, breakend)
- Annotation (fusion, exon loss, intergenic, intronic)
- Gene name
- The seventh SnpEff field (often transcript name)
- For exon loss variants, deleted exon numbers (Exon5del). For fusions, one of
KNOWN_FUSION
,ON_PRIORITY_LIST
orNOT_PRIORITISED
- Priority of the event (1 highest, 3 lowest)
example:
before:
chr17 41258467 del_5 ATATACCTTTTGGTTATATCATTCTTACATAAAGGACACTGTGAAGGCCCTTTCTTCTGGTTGAGAAGTTTCAGCATGCAAAATCTATA A . . END=41258555;SVTYPE=DEL;SVLEN=-88;UPSTREAM_PAIR_COUNT=0;DOWNSTREAM_PAIR_COUNT=0;PAIR_COUNT=0;ANN=A|exon_loss_variant&splice_acceptor_variant&splice_donor_variant&splice_region_variant&splice_region_variant&splice_region_variant&splice_region_variant&intron_variant&intron_variant|HIGH|BRCA1|BRCA1|transcript|NM_007294.3|Coding|4/23|c.135-5_212+5delTATAGATTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAGGTATA||||||
after:
chr17 41258467 del_5 ATATACCTTTTGGTTATATCATTCTTACATAAAGGACACTGTGAAGGCCCTTTCTTCTGGTTGAGAAGTTTCAGCATGCAAAATCTATA A . . END=41258555;SVTYPE=DEL;SVLEN=-88;UPSTREAM_PAIR_COUNT=0;DOWNSTREAM_PAIR_COUNT=0;PAIR_COUNT=0;ANN=A|exon_loss_variant&splice_acceptor_variant&splice_donor_variant&splice_region_variant&splice_region_variant&splice_region_variant&splice_region_variant&intron_variant&intron_variant|HIGH|BRCA1|BRCA1|transcript|NM_007294.3|Coding|4/23|c.135-5_212+5delTATAGATTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAGGTATA||||||;SIMPLE_ANN=DEL|EXON_DEL|BRCA1|NM_007294.3|Exon5del