Pave annotates a somatic variant VCF with gene and transcript coding and protein effects.
For each impacted transcript it will add the following fields:
Field | Description |
---|---|
Gene | Ensembl gene ID |
GeneName | HGNC gene name |
Transcript | Ensembl transcript ID |
Effects | list of effects separated by '&' |
SpliceRegion | true/false - if variant overlaps with the 8 intronic bases or 3 exonic bases around a splice junction |
HgvsCodingImpact | HGVS coding impact |
HgvsProteinImpact | HGVS protein impact |
For any variant with one or more impacted transcripts, the following summary data is written:
Field | Description |
---|---|
Gene | HGNC gene name |
Transcript | Ensembl canonical transcript ID |
CanonicalEffect | list of effects separated by '&' |
CanonicalCodingEffect | NONE, SPLICE, NONSENSE_OR_FRAMESHIFT, MISSENSE or SYNONYMOUS |
SpliceRegion | true/false - if variant overlaps with the 8 intronic bases or 3 exonic bases around a splice junction |
HgvsCodingImpact | HGVS coding impact |
HgvsProteinImpact | HGVS protein impact |
OtherReportableEffects | Transcript, HGVS Coding, HGVS Protein, Effects, CodingEffect for other reportable transcripts * |
WorstCodingEffect | From all transcripts |
GenesAffected | Count of genes which the variant overlaps |
* if additional reportable transcripts are configured in driver gene panel
Argument | Description |
---|---|
sample | Sample ID |
vcf_file | Input variant VCF |
ref_genome | Reference genome fasta file |
ensembl_data_dir | Path to Ensembl data cache directory |
ref_genome_version | 37 (default) or 38 |
Argument | Description |
---|---|
driver_gene_panel | Driver Gene Panel |
output_dir | Output directory for VCF and transcript CSV, will use input VCF directory if not specified |
output_vcf_file | Specify the output VCF filename |
only_canonical | Only annotate impacts on canonical transcripts |
read_pass_only | Only process passing variants |
threads | Splits variants by chromosome across threads |
write_pass_only | Only write passing variants |
write_transcript_data | Write a detailed TSV file for each impacted transcript |
java -jar pave.jar
-sample SAMPLE_ID
-vcf_file /path_to_somatic_vcf_file/
-ensembl_data_dir /path_to_ensembl_files/
-driver_gene_panel /path_to_gene_panel/
-ref_genome /path_to_ref_genome_fasta/
-ref_genome_version [37 or 38]
-output_dir /path_to_write_data_files/
-threads 8
The following annotations can be applied by PAVE:
Argument | Description | VCF Tag |
---|---|---|
pon_file | PON file to annotate and filter variants (see file format below) | PON_COUNT, PON_MAX |
pon_artefact_file | Panel PON artefact file to annotate and filter variants | PON_PANEL |
pon_filters | Apply PON filters (see details below) | Filter 'PON' |
mappability_bed | BED file with mappability values | MAPPABILITY=value |
clinvar_vcf | VCF from Clinvar database | Writes CLNSIG=significance and CLNSIGCONF=conflicting info. Also affects use of pon_file and gnomad_freq |
gnomad_freq_file | CSV with Gnomad frequencies per variant | GND_FREQ |
gnomad_freq_dir | Path to Gnmoad frequency files per chromosome | GND_FREQ |
blacklist_bed | BED file with blacklist entries | BLACKLIST_BED |
blacklist_vcf | VCF file with blacklist entries | BLACKLIST_VCF |
PAVE predicts the coding impact, protein impact and coding effect of each variant on every overlapping transcript including for up to 1kb upstream. Note that PAVE takes into account nearby mutations in the same local phase set (LPS) where they alter effect (either same codon, or multiple frameshifts leading to a single inframe indel). If a variant exists in multiple LPS then the one with the highest support is used.
The following annotations are added for each affected transcript:
- Gene
- TranscriptId
- HGVSCodingImpact
- HGVSProteinImpact
- Effect
- CodingEffect
- SpliceRegion (T/F - any variant that overlaps within 3 exonic or 8 intronic bases of a splice site)
- Population Frequency (gnomAD)
The effects and codingEffects supported by PAVE are the following:
Effect1 | Coding effect2 |
---|---|
• upstream_gene_variant (<1kb) • intron_variant • 5_prime_UTR_variant • 3_prime_UTR_variant • non_coding_transcript_exon_variant |
NONE |
• synonymous_variant • phased_synonymous4 |
SYNONYMOUS |
• missense_variant • inframe_insertion3 • inframe_deletion3 • phased_inframe_insertion4 • phased_inframe_deletion4 • phased_missense4 |
MISSENSE |
• stop_gained • frameshift • start_lost5 • stop_lost5 |
NONSENSE_OR_FRAMESHIFT |
• splice_donor_variant (D-1,D+1,D+2,D+5) • splice_acceptor_variant (A+1;A+2; A+3 if ALT=G only) |
SPLICE6,7 |
Notes:
- Multiple effects are possible and should both be annotated in the following circumstances:
- An Inframe_insertion, inframe_deletion or frameshift variant may also be stop_gained or stop_lost
- Any deletion or MNV unambiguously overlapping an exon boundary OR an SNV at splice_donor(D-1) will be both splice_acceptor/spice_donor AND one of synonymous_variant, stop_gained, missense_variant, 5_prime_UTR_variant, 3_prime_UTR_variant or non_coding_transcript_exon_variant
- Where a variant has multiple effects, PAVE ranks in the following order for codingEffect:
- NONSENSE_OR_FRAMESHIFT
- SPLICE
- MISSENSE
- SYNONYMOUS
- NONE
- Inframe INDELs may occasionally be annotated as notionally partially or completely outside the coding region due to left alignment and microhomology. Any INDEL with a length that is a multiplier of 3, that can be right aligned to be fully inside the coding regions should be marked as effect=inframe_insertion/inframe_deletion (notable examples include known pathogenic variants in KIT (4:55593579 CAGAAACCCATGTATGAAGTACAGTGGA > C) and EGFR (7:55248980 C > CTCCAGGAAGCCT)).
- Where there are 2 or more frameshift variants with the same LPS (local phase set), if the combined impact causes an inframe indel, then mark both as effect = phased_inframe_deletion / phased_inframe_insertion. If a phased inframe indel and snv or mnv affect the same codon, then mark both as phased_inframe_deletion / phased_inframe_insertion and calculate the combined coding effect (eg. EGFR p.Glu746_Ser752delinsVal). If the net impact is no inserted or deleted bases, then classify the phased variants as phased missense or synonymous.
- Where an INDEL also leads to a stop_lost or start_lost, the lost effects are prioritised
- A SPLICE MNV needs to be marked as splice if any base overlaps a splice site.
- Any INDEL which overlaps a canonical splice region (ie.[D-1:D+5] OR [A+3:A+1]) should be marked as splice_donor/splice_acceptor if and only if the canonical sites are changed according to the SPLICE rules listed above. Where an INDEL has microhomology extends over a splice donor or splice acceptor region, the variant is tested at both the leftmost and rightmost alignment, with intronic only effects prioritised highest, then exonic effects and finally splice effects. A notable recurrent example where D+5 is not affected by an indel with microhomology in GRCH37 are indels at the homopolymer at MSH2 2:47641559. Both splice and frameshift/inframe effects may be reported together if a deletion unambiguously both overlaps coding bases and changes canonical splice sites.
For coding transcripts use c. and for non coding transcripts use n.
Nucleotide numbering conventions:
- No nucleotide 0.
- Nucleotide 1 is the A of the ATG-translation initiation codon (For non coding 1 is the 1st base of the transcript)
- Nucleotide -1 is the 1st base prior to the ATG-translation initiation codon
- Nucleotide *1 is the 1st base 3’ of the translation stop codon
- Use most 3’ position in case of homology for duplications and deletions (with an exception for homology at splice boundary. See https://www.hgvs.org/mutnomen/recs-DNA.html#except)
TERT only is also annotated into the upstream region for 300 bases upstream of the start codon of the canonical transcript due to the widespread conventions around TERT promoter mutations
Examples:
Type | Location | Examples | Notes |
---|---|---|---|
Substitutions (SNV) | Coding | c.76A>C | Includes start and stop codons |
_ | 5’UTR | c.-14G>C | |
_ | Intronic (post) | c.88+1G>T | |
_ | Intronic (pre) | c.89-2A>C | |
_ | 3’UTR | c.*46T>A | |
Deletions | Coding | c.76_78delACT | |
_ | Intronic (pre) | c.726-5537_726-5536delTT | |
_ | 5’UTR Intronic | c.-147-1093delA | |
Duplications | Coding | c.128dupA | Use duplications in case of INS with full homology match. |
_ | Intronic (post) | c.830+11459_830+11461dupGGA | |
_ | Start codon overlap | c.-1_1dupAA | |
Insertions | Coding | c.1033_1034insA | |
_ | Intronic (post) | c.15+1619_15+1620insTTTGTT | |
_ | Intronic (5’UTR) | c.-23-304_-23-303insA | |
Substitutions (MNV) | Coding | c.1390_1391delCGinsTT | |
_ | Intronic | c.853-2260_853-2258delCACinsTAT | |
Complex Indels | Coding | c.112_117delinsTG |
Phased inframe variants should get the combined impact of both variants
Amino acid numbering conventions
- AA 1 = Translation initiator Methionine
- AA *110+1 = 1st AA after the stop codon at AA position 110
- Use most 3’ residue in case of AA homology
Any variant with codingEffect = SPLICE in a coding transcript is given the hvgs protein impact 'p.?'. Other variants are annotated as per below examples:
Type | Context | Examples | Notes |
---|---|---|---|
SYNONYMOUS | Synonymous | p.Leu54= | Snpeff uses p.Leu54Leu but the recommendation has changed (https://www.hgvs.org/mutnomen/disc.html#silent) |
_ | Synonymous (MNV multiple codon) | p.Leu54_Arg55= | |
_ | Synonymous (Stop retained) | p.Ter26= | |
MISSENSE | Missense | p.Trp26Cys | |
_ | Missense (MNV multiple codon) | p.Ala100_Val101delinsArgTrp | SnpEff uses format: p.AlaVal100ArgTrp |
NONSENSE OR FRAMESHIFT | Stop Gained | p.Trp26* | |
_ | Stop Gained (MNV multiple codon - 2nd codon stop) | p.Cys495_Val496delinsArg* | |
_ | Stop Gained (MNV multiple codon - 1st codon stop) | p.Cys495_Val496delins* | |
_ | Frameshift | p.Arg97fs | Always use simply fs even if also stop gained |
_ | Stop Lost | p.Ter407Trpext*? | Ie. a STOP at AA407 changes to a W and extends the protein |
_ | Start Lost | p.Met1? | Any variant that disrupts initiator codon |
INFRAME | Deletion (single AA) | p.Lys2del | |
_ | Deletion (range) | p.Gly4_Gln6del | |
_ | Deletion (non conservative) | p.Cys28_Lys29delinsTrp | |
_ | Duplication (single AA) | p.Gln8dup | |
_ | Duplication (range) | p.Gly4_Gln6dup | |
_ | Insertion | p.Lys2_Leu3insGlnSer | |
_ | Insertion (conservative stop) | p.Ser81_Val82ins* | Ie. a STOP codon is inserted between AA81 and AA82 |
_ | Insertion (non conservative) | p.Cys28delinsTrpVal | |
MIXED | Inframe Deletion with stop lost (single AA) | p.104Terdelext*? | |
_ | Inframe Deletion with stop lost (multiple AA non conservative) | p.Val98_Ter104delinsArgext*? | |
_ | Inframe insertion + stop gained | p.Leu339delinsHisPhe* | Ignore any AA inserted AFTER the stop codon |
Pave can annotate with PON values if the config 'pon_file' is used. The PON file must be a TSV with the following fields:
Chromosome Position Ref Alt SamplesCount MaxSampleReads TotalReads
1 10003 A T 10 30 191
1 10006 C A 16 11 86
1 10007 T G 40 16 234
Pave will then add VCF tags 'PON_COUNT' from SamplesCount and 'PON_MAX' from MaxSampleReads for any matched variant.
If the config 'pon_filters' is used, then Pave will additionally add the filter 'PON' to variants which exceed both the specified SamplesCount and MaxSampleReads values. The filters can be set per variant tier in the form: 'TIER;SAMPLE_COUNT_LIMIT;MAX_READS_LIMIT, for example
HOTSPOT:5:5;PANEL:2:5;UNKNOWN:2:0
will mark variants of tier = HOTSPOT as PON if they match an entry with 5+ SamplesCount and 5+ MaxSampleReads, 2+ and 5+ for a PANEL tier variant, and 2+ for any other tier variant.
Some variants (such as indels in microsatellite contexts) are likely to be artefacts at low VAF, but genuine pathogenic variants at high VAF. As such, we establish a concept of a 'potentially pathogenic' variant as a germline or somatic hotspot, or Clinvar pathogenic/likely_pathogenic variant. Then 'potentially pathogenic' variants are not filtered in this way if mean PON reads (i.e. TotalReads/SamplesCount) < 6 and sample VAF > 8%. If the variant is an indel in a microsatellite context spanning N ref bases, we also require VAF > N%.
Finally, variants that are only infrequently in our PON, are likely pathogenic, and are not associated with a long microsatellite should not be PON filtered. Specifically, 'potentially pathogenic' variants with SamplesCount < 10 (and if an indel, has a repeat count < 4) are never PON filtered.
Pave additionally annotates and filters variants based on a supplied Panel PON Artefact file. Any variant present in this file will get the filter 'PONArtefact' from Pave, except for those satisfying the previously mentioned condition - 'potentially pathogenic' variants with SamplesCount < 10 (and if an indel, has a repeat count < 4)
We annotate the population frequency using gnomAD v3.1.2 for hg38 (merged with gnomAD v2.1.1 liftover for exome regions only) and v2.1.1 exome only for GRCH37. We filter the Gnomad file for variants with at least 1e-5 frequency for exome only and 5e-5 for genome. The VCF tag 'GND_FREQ' will report the frequency. Variants with over 0.00015 cohort frequency will get a 'PONGnomad' filter applied by Pave, unless they are 'potentially pathogenic' in which case we require 0.01 frequency.
If a clinvar VCF is provided, PAVE annotates the clinical signficance of each variant. This is also used to determine whether a variant is 'potentially pathogenic', as described above.
A summary of the PON annotation and filtering currently used in the HMF pipeline is below. First, consult the below table to see if the variant is exempted from any PON filters:
Condition 1 | Condition 2 | Condition 3 | Exempted from |
---|---|---|---|
'Potentially pathogenic' | SamplesCount < 10 | RepeatCount < 4 (for indels) | PON, PON_PANEL_ARTEFACT |
'Potentially pathogenic' | Mean PON reads < 6 | Sample VAF > 8% | PON |
Then for any remaining variants, consult this table:
Filter | Annotations | Source | Filter Thresholds | Ref Genome versions |
---|---|---|---|---|
PON_GNOMAD | GND_FREQ | Gnomad v3 | GND_FREQ >= - 0.01 if 'potentially pathogenic' - 0.00015 otherwise |
38 only |
PON_PANEL_ARTEFACT | Curated FFPE Panel Artefacts*** | PON_PANEL | PON_PANEL {ANY} | 38 only |
PON | PON_COUNT* | PON_MAX** | HMF Cohort | See detailed table below |
* Count germline samples with at least 3 reads and sum of base quality > 30 ** Maximum read support in any one sample *** The FFPE panel artefacts were curated from recurrent variants in the panel regions only of 35 samples run on HMF FFPE tumor only panels.
The filters for the HMF cohort PON depend on the ref genome version and are as follows:
Version | Samples | HOTSPOT | PANEL | OTHER |
---|---|---|---|---|
37 | 1000 | PON_MAX>=5 & PON_COUNT>=10 | PON_MAX>=5 & PON_COUNT>=6 | PON_COUNT>=6 |
38 | 98 | PON_MAX>=5 & PON_COUNT>=5 | PON_MAX>=5 & PON_COUNT>=2 | PON_COUNT>=2 |
- Frameshifts may not always be fully aligned to 3'UTR for HGNC protein annotation
- Where multiple ALTs are included on a single line only the 1st ALT allele will be annotated. A workaround is to split multiallelic lines into multiple records first (eg. the -m none option in bcftools).
- Duplications may sometimes be marked as insertions
- Inserts between the translation/coding start base and the previous base may not be classified correctly and/or have the wrong codon impact due to left alignment