Skip to content

Latest commit

 

History

History
272 lines (217 loc) · 17.3 KB

README.md

File metadata and controls

272 lines (217 loc) · 17.3 KB

Pave (Prediction and Annotation of Variant Effects)

Pave annotates a somatic variant VCF with gene and transcript coding and protein effects.

For each impacted transcript it will add the following fields:

Field Description
Gene Ensembl gene ID
GeneName HGNC gene name
Transcript Ensembl transcript ID
Effects list of effects separated by '&'
SpliceRegion true/false - if variant overlaps with the 8 intronic bases or 3 exonic bases around a splice junction
HgvsCodingImpact HGVS coding impact
HgvsProteinImpact HGVS protein impact

For any variant with one or more impacted transcripts, the following summary data is written:

Field Description
Gene HGNC gene name
Transcript Ensembl canonical transcript ID
CanonicalEffect list of effects separated by '&'
CanonicalCodingEffect NONE, SPLICE, NONSENSE_OR_FRAMESHIFT, MISSENSE or SYNONYMOUS
SpliceRegion true/false - if variant overlaps with the 8 intronic bases or 3 exonic bases around a splice junction
HgvsCodingImpact HGVS coding impact
HgvsProteinImpact HGVS protein impact
OtherReportableEffects Transcript, HGVS Coding, HGVS Protein, Effects, CodingEffect for other reportable transcripts *
WorstCodingEffect From all transcripts
GenesAffected Count of genes which the variant overlaps

* if additional reportable transcripts are configured in driver gene panel

Running Pave

Mandatory Arguments

Argument Description
sample Sample ID
vcf_file Input variant VCF
ref_genome Reference genome fasta file
ensembl_data_dir Path to Ensembl data cache directory
ref_genome_version 37 (default) or 38

Optional Arguments

Argument Description
driver_gene_panel Driver Gene Panel
output_dir Output directory for VCF and transcript CSV, will use input VCF directory if not specified
output_vcf_file Specify the output VCF filename
only_canonical Only annotate impacts on canonical transcripts
read_pass_only Only process passing variants
threads Splits variants by chromosome across threads
write_pass_only Only write passing variants
write_transcript_data Write a detailed TSV file for each impacted transcript
java -jar pave.jar 
  -sample SAMPLE_ID
  -vcf_file /path_to_somatic_vcf_file/
  -ensembl_data_dir /path_to_ensembl_files/
  -driver_gene_panel /path_to_gene_panel/
  -ref_genome /path_to_ref_genome_fasta/
  -ref_genome_version [37 or 38] 
  -output_dir /path_to_write_data_files/ 
  -threads 8

Optional Annotations

The following annotations can be applied by PAVE:

Argument Description VCF Tag
pon_file PON file to annotate and filter variants (see file format below) PON_COUNT, PON_MAX
pon_artefact_file Panel PON artefact file to annotate and filter variants PON_PANEL
pon_filters Apply PON filters (see details below) Filter 'PON'
mappability_bed BED file with mappability values MAPPABILITY=value
clinvar_vcf VCF from Clinvar database Writes CLNSIG=significance and CLNSIGCONF=conflicting info. Also affects use of pon_file and gnomad_freq
gnomad_freq_file CSV with Gnomad frequencies per variant GND_FREQ
gnomad_freq_dir Path to Gnmoad frequency files per chromosome GND_FREQ
blacklist_bed BED file with blacklist entries BLACKLIST_BED
blacklist_vcf VCF file with blacklist entries BLACKLIST_VCF

Overview and algorithm

PAVE predicts the coding impact, protein impact and coding effect of each variant on every overlapping transcript including for up to 1kb upstream. Note that PAVE takes into account nearby mutations in the same local phase set (LPS) where they alter effect (either same codon, or multiple frameshifts leading to a single inframe indel). If a variant exists in multiple LPS then the one with the highest support is used.

The following annotations are added for each affected transcript:

  • Gene
  • TranscriptId
  • HGVSCodingImpact
  • HGVSProteinImpact
  • Effect
  • CodingEffect
  • SpliceRegion (T/F - any variant that overlaps within 3 exonic or 8 intronic bases of a splice site)
  • Population Frequency (gnomAD)

Effect and Coding Effect

The effects and codingEffects supported by PAVE are the following:

Effect1 Coding effect2
• upstream_gene_variant (<1kb)
• intron_variant
• 5_prime_UTR_variant
• 3_prime_UTR_variant
• non_coding_transcript_exon_variant
NONE
• synonymous_variant
• phased_synonymous4
SYNONYMOUS
• missense_variant
• inframe_insertion3
• inframe_deletion3
• phased_inframe_insertion4
• phased_inframe_deletion4
• phased_missense4
MISSENSE
• stop_gained
• frameshift
• start_lost5
• stop_lost5
NONSENSE_OR_FRAMESHIFT
• splice_donor_variant (D-1,D+1,D+2,D+5)
• splice_acceptor_variant (A+1;A+2; A+3 if ALT=G only)
SPLICE6,7

Notes:

  1. Multiple effects are possible and should both be annotated in the following circumstances:
    • An Inframe_insertion, inframe_deletion or frameshift variant may also be stop_gained or stop_lost
    • Any deletion or MNV unambiguously overlapping an exon boundary OR an SNV at splice_donor(D-1) will be both splice_acceptor/spice_donor AND one of synonymous_variant, stop_gained, missense_variant, 5_prime_UTR_variant, 3_prime_UTR_variant or non_coding_transcript_exon_variant
  2. Where a variant has multiple effects, PAVE ranks in the following order for codingEffect:
    • NONSENSE_OR_FRAMESHIFT
    • SPLICE
    • MISSENSE
    • SYNONYMOUS
    • NONE
  3. Inframe INDELs may occasionally be annotated as notionally partially or completely outside the coding region due to left alignment and microhomology. Any INDEL with a length that is a multiplier of 3, that can be right aligned to be fully inside the coding regions should be marked as effect=inframe_insertion/inframe_deletion (notable examples include known pathogenic variants in KIT (4:55593579 CAGAAACCCATGTATGAAGTACAGTGGA > C) and EGFR (7:55248980 C > CTCCAGGAAGCCT)).
  4. Where there are 2 or more frameshift variants with the same LPS (local phase set), if the combined impact causes an inframe indel, then mark both as effect = phased_inframe_deletion / phased_inframe_insertion. If a phased inframe indel and snv or mnv affect the same codon, then mark both as phased_inframe_deletion / phased_inframe_insertion and calculate the combined coding effect (eg. EGFR p.Glu746_Ser752delinsVal). If the net impact is no inserted or deleted bases, then classify the phased variants as phased missense or synonymous.
  5. Where an INDEL also leads to a stop_lost or start_lost, the lost effects are prioritised
  6. A SPLICE MNV needs to be marked as splice if any base overlaps a splice site.
  7. Any INDEL which overlaps a canonical splice region (ie.[D-1:D+5] OR [A+3:A+1]) should be marked as splice_donor/splice_acceptor if and only if the canonical sites are changed according to the SPLICE rules listed above. Where an INDEL has microhomology extends over a splice donor or splice acceptor region, the variant is tested at both the leftmost and rightmost alignment, with intronic only effects prioritised highest, then exonic effects and finally splice effects. A notable recurrent example where D+5 is not affected by an indel with microhomology in GRCH37 are indels at the homopolymer at MSH2 2:47641559. Both splice and frameshift/inframe effects may be reported together if a deletion unambiguously both overlaps coding bases and changes canonical splice sites.

HGVS Coding Impact

For coding transcripts use c. and for non coding transcripts use n.

Nucleotide numbering conventions:

  • No nucleotide 0.
  • Nucleotide 1 is the A of the ATG-translation initiation codon (For non coding 1 is the 1st base of the transcript)
  • Nucleotide -1 is the 1st base prior to the ATG-translation initiation codon
  • Nucleotide *1 is the 1st base 3’ of the translation stop codon
  • Use most 3’ position in case of homology for duplications and deletions (with an exception for homology at splice boundary. See https://www.hgvs.org/mutnomen/recs-DNA.html#except)

TERT only is also annotated into the upstream region for 300 bases upstream of the start codon of the canonical transcript due to the widespread conventions around TERT promoter mutations

Examples:

Type Location Examples Notes
Substitutions (SNV) Coding c.76A>C Includes start and stop codons
_ 5’UTR c.-14G>C
_ Intronic (post) c.88+1G>T
_ Intronic (pre) c.89-2A>C
_ 3’UTR c.*46T>A
Deletions Coding c.76_78delACT
_ Intronic (pre) c.726-5537_726-5536delTT
_ 5’UTR Intronic c.-147-1093delA
Duplications Coding c.128dupA Use duplications in case of INS with full homology match.
_ Intronic (post) c.830+11459_830+11461dupGGA
_ Start codon overlap c.-1_1dupAA
Insertions Coding c.1033_1034insA
_ Intronic (post) c.15+1619_15+1620insTTTGTT
_ Intronic (5’UTR) c.-23-304_-23-303insA
Substitutions (MNV) Coding c.1390_1391delCGinsTT
_ Intronic c.853-2260_853-2258delCACinsTAT
Complex Indels Coding c.112_117delinsTG

HGVS Protein Impact

Phased inframe variants should get the combined impact of both variants

Amino acid numbering conventions

  • AA 1 = Translation initiator Methionine
  • AA *110+1 = 1st AA after the stop codon at AA position 110
  • Use most 3’ residue in case of AA homology

Any variant with codingEffect = SPLICE in a coding transcript is given the hvgs protein impact 'p.?'. Other variants are annotated as per below examples:

Type Context Examples Notes
SYNONYMOUS Synonymous p.Leu54= Snpeff uses p.Leu54Leu but the recommendation has changed (https://www.hgvs.org/mutnomen/disc.html#silent)
_ Synonymous (MNV multiple codon) p.Leu54_Arg55=
_ Synonymous (Stop retained) p.Ter26=
MISSENSE Missense p.Trp26Cys
_ Missense (MNV multiple codon) p.Ala100_Val101delinsArgTrp SnpEff uses format: p.AlaVal100ArgTrp
NONSENSE OR FRAMESHIFT Stop Gained p.Trp26*
_ Stop Gained (MNV multiple codon - 2nd codon stop) p.Cys495_Val496delinsArg*
_ Stop Gained (MNV multiple codon - 1st codon stop) p.Cys495_Val496delins*
_ Frameshift p.Arg97fs Always use simply fs even if also stop gained
_ Stop Lost p.Ter407Trpext*? Ie. a STOP at AA407 changes to a W and extends the protein
_ Start Lost p.Met1? Any variant that disrupts initiator codon
INFRAME Deletion (single AA) p.Lys2del
_ Deletion (range) p.Gly4_Gln6del
_ Deletion (non conservative) p.Cys28_Lys29delinsTrp
_ Duplication (single AA) p.Gln8dup
_ Duplication (range) p.Gly4_Gln6dup
_ Insertion p.Lys2_Leu3insGlnSer
_ Insertion (conservative stop) p.Ser81_Val82ins* Ie. a STOP codon is inserted between AA81 and AA82
_ Insertion (non conservative) p.Cys28delinsTrpVal
MIXED Inframe Deletion with stop lost (single AA) p.104Terdelext*?
_ Inframe Deletion with stop lost (multiple AA non conservative) p.Val98_Ter104delinsArgext*?
_ Inframe insertion + stop gained p.Leu339delinsHisPhe* Ignore any AA inserted AFTER the stop codon

PON Annotation and Filtering

Pave can annotate with PON values if the config 'pon_file' is used. The PON file must be a TSV with the following fields:

Chromosome      Position        Ref     Alt     SamplesCount    MaxSampleReads  TotalReads
1       10003   A       T       10      30      191
1       10006   C       A       16      11      86
1       10007   T       G       40      16      234

Pave will then add VCF tags 'PON_COUNT' from SamplesCount and 'PON_MAX' from MaxSampleReads for any matched variant.

If the config 'pon_filters' is used, then Pave will additionally add the filter 'PON' to variants which exceed both the specified SamplesCount and MaxSampleReads values. The filters can be set per variant tier in the form: 'TIER;SAMPLE_COUNT_LIMIT;MAX_READS_LIMIT, for example

HOTSPOT:5:5;PANEL:2:5;UNKNOWN:2:0

will mark variants of tier = HOTSPOT as PON if they match an entry with 5+ SamplesCount and 5+ MaxSampleReads, 2+ and 5+ for a PANEL tier variant, and 2+ for any other tier variant.

Some variants (such as indels in microsatellite contexts) are likely to be artefacts at low VAF, but genuine pathogenic variants at high VAF. As such, we establish a concept of a 'potentially pathogenic' variant as a germline or somatic hotspot, or Clinvar pathogenic/likely_pathogenic variant. Then 'potentially pathogenic' variants are not filtered in this way if mean PON reads (i.e. TotalReads/SamplesCount) < 6 and sample VAF > 8%. If the variant is an indel in a microsatellite context spanning N ref bases, we also require VAF > N%.

Finally, variants that are only infrequently in our PON, are likely pathogenic, and are not associated with a long microsatellite should not be PON filtered. Specifically, 'potentially pathogenic' variants with SamplesCount < 10 (and if an indel, has a repeat count < 4) are never PON filtered.

Artefact PON Annotation and Filtering

Pave additionally annotates and filters variants based on a supplied Panel PON Artefact file. Any variant present in this file will get the filter 'PONArtefact' from Pave, except for those satisfying the previously mentioned condition - 'potentially pathogenic' variants with SamplesCount < 10 (and if an indel, has a repeat count < 4)

GNOMAD Population Frequency

We annotate the population frequency using gnomAD v3.1.2 for hg38 (merged with gnomAD v2.1.1 liftover for exome regions only) and v2.1.1 exome only for GRCH37. We filter the Gnomad file for variants with at least 1e-5 frequency for exome only and 5e-5 for genome. The VCF tag 'GND_FREQ' will report the frequency. Variants with over 0.00015 cohort frequency will get a 'PONGnomad' filter applied by Pave, unless they are 'potentially pathogenic' in which case we require 0.01 frequency.

CLINVAR

If a clinvar VCF is provided, PAVE annotates the clinical signficance of each variant. This is also used to determine whether a variant is 'potentially pathogenic', as described above.

PON settings used in the HMF pipeline

A summary of the PON annotation and filtering currently used in the HMF pipeline is below. First, consult the below table to see if the variant is exempted from any PON filters:

Condition 1 Condition 2 Condition 3 Exempted from
'Potentially pathogenic' SamplesCount < 10 RepeatCount < 4 (for indels) PON, PON_PANEL_ARTEFACT
'Potentially pathogenic' Mean PON reads < 6 Sample VAF > 8% PON

Then for any remaining variants, consult this table:

Filter Annotations Source Filter Thresholds Ref Genome versions
PON_GNOMAD GND_FREQ Gnomad v3 GND_FREQ >=
- 0.01 if 'potentially pathogenic'
- 0.00015 otherwise
38 only
PON_PANEL_ARTEFACT Curated FFPE Panel Artefacts*** PON_PANEL PON_PANEL {ANY} 38 only
PON PON_COUNT* PON_MAX** HMF Cohort See detailed table below

* Count germline samples with at least 3 reads and sum of base quality > 30 ** Maximum read support in any one sample *** The FFPE panel artefacts were curated from recurrent variants in the panel regions only of 35 samples run on HMF FFPE tumor only panels.

The filters for the HMF cohort PON depend on the ref genome version and are as follows:

Version Samples HOTSPOT PANEL OTHER
37 1000 PON_MAX>=5 & PON_COUNT>=10 PON_MAX>=5 & PON_COUNT>=6 PON_COUNT>=6
38 98 PON_MAX>=5 & PON_COUNT>=5 PON_MAX>=5 & PON_COUNT>=2 PON_COUNT>=2

Known Issues

  • Frameshifts may not always be fully aligned to 3'UTR for HGNC protein annotation
  • Where multiple ALTs are included on a single line only the 1st ALT allele will be annotated. A workaround is to split multiallelic lines into multiple records first (eg. the -m none option in bcftools).
  • Duplications may sometimes be marked as insertions
  • Inserts between the translation/coding start base and the previous base may not be classified correctly and/or have the wrong codon impact due to left alignment

Version History and Download Links