BoostDM v2024.07.15-cancer
This boostDM release has been implemented with the output produced by Intogen v2024.06.21. Several improvements and features have been included, including changes in model implementation, reference datasets and bugfixes. Herein we describe the main technical changes, we summarize the new output of the pipeline and provide a comparison with the previous release.
Model Implementation
Uniform cluster features across cohorts of same tumor type category
Although mutational cluster features (OncodriveCLUSTL, HotMAPS and smRegions) are generated on a cohort-by-cohort basis, boostDM models are specified by tumor-type, meaning that an aggregation or consensus step has to be taken in order to annotate the mutations in the training and test sets.
In the previous release, mutations with the same genomic identity (same genomic coordinates and alternate allele) would be annotated with different mutational cluster feature values depending on which cohort they were originally reported in, thereby yielding instances of mutations with same identity but different cluster feature values across samples of the same tumor type.
In the current version we have changed this criterion. Now if a mutation is mapped by a clustering feature in some cohort belonging to a given tumor type, all the instances of the mutation across the tumor type will get the same cluster feature value in the training and prediction corresponding to this tumor type.
Ordinal encoding of cluster features
Each cluster method (OncodriveCLUSTL, HotMAPS and smRegions) is now associated with a single ordinal feature with three levels:
2
- the mutation is associated to a cluster in the same tumor type1
- the mutation is only associated to a cluster in another tumor type0
- the mutation is not associated to the cluster in any case
Therefore we no longer keep the distinction between cat_1
and cat_2
binary-encoded cluster features. Moreover, we do not use the OncodriveCLUSTL_SCORE
numerical feature anymore due to the little gain in predictive ability and explainability along with the added complexity at interpreting the feature contributions at prediction.
Unweighted consensus of base classifiers
The forecast aggregator combining the 50 base classifiers per model is now unweighted, meaning that the base classifiers are no longer favored or penalized based on their cross-validation performance. Particularly in those cases where the cross-validation test set remaining after removing repeated mutation instances is very small, the performance score of a single base classifier can be biased and misleading to determine whether the base classifier behavior needs to be promoted in the forecast aggregator.
Model selection
Our model selection strategy is now conducted in two steps:
- Models are only trained if there are at least 30 mutations on average across the training splits of the 50 base classifiers. Note that the training splits are label-balanced sets comprising 70% of the total mutation training set. Note that the size of the mutation set available for training does in turn depend on the dN/dS excess per consequence type inferred at each cohort.
- We apply a composite rule that requires:
- Mean CV performance F-score50 across base classifier
>= 0.8
- a minimum number of observed mutations per each tier defined by the discovery index
- Mean CV performance F-score50 across base classifier
We have updated step 2 by changing the discovery index tiers and the minimum number of observed mutations for each tier. The new step 2 reads in Python code as follows:
DISCOVERY_TIERS = (0, 0.5, 0.75)
MUTATION_TIERS = (50, 30, 0)
FSCORE_THRESHOLD = 0.8
def meet_condition(fscore, discovery, n_muts):
if fscore >= FSCORE_THRESHOLD:
for discovery_thresh, n_muts_thresh in zip(DISCOVERY_TIERS, MUTATION_TIERS):
if (discovery >= discovery_thresh) and (n_muts >= n_muts_thresh):
return True
return False
The rationale behind this change is that it is very difficult to assert qualitative differences between discovery index values within the (0, 0.5) range, making a case subdivision in this range not well justified. Therefore by providing broader discovery index tiers we make sure to provide more qualitatively distinct discovery classes.
Pan-genes model are no longer trained
We no more train meta-gene models based on the dichotomous mode-of-action classification into oncogenes (Act) and tumor suppressor (LoF) genes. This is in part because the models end up being dominated by a few genes contributing more mutations to each Act or LoF class. The pipeline now only renders models that are gene-specific. Consequently the oncogenic mode-of-action or “role” of the gene as a feature is no longer used. As a result, the column “selected_model_gene” is no longer used in the in silico saturation mutagenesis output tables.
Prediction output format
The in silico saturation mutagenesis output is displayed with some formatting differences, with filenames consistent with the expression [gene].model.[ttype_model].features.[ttype_features].prediction.tsv.gz. The term ttype_model represents the tumor type used as training context, the set of cohorts where the signals of positive selection were calculated and from which the training mutations were taken for the specific model used to cast the prediction, i.e., in which tumor type context the model was learned. The term ttype_features is the tumor type corresponding to the features that are used to encode the input mutation as a vector of feature values for prediction, i.e., the tumor type of the query mutation. In this release, we provide in silico saturation mutagenesis outputs where the predictive model has been trained in the same tumor type context of the query mutation. As a consequence of making explicit the context of the training and the query, the column selected_model_ttype is no longer part of the output tables.
Data Updates
Ensembl Variant Effect Predictor
The new release uses ENSEMBL Variant Effect Predictor (VEP) v.111 annotations instead of ENSEMBL VEP v.92.
Transcripts
Ensembl canonical transcripts are no longer the reference transcripts in order to interpret relevant sites and the consequence type of the mutations. The current release is built on the MANE (Matched Annotation from NCBI and EMBL-EBI) transcripts. We have switched to using MANE Select transcripts to align with common standards used in the clinics. This redefinition of the transcript is in accordance with the current Intogen release v2024.06.212, where it has an impact on the regions used in methods like OncodriveFML, OncodriveCLUSTLand and smRegions, also when computing the mutational profiles and when building coding sequence regions for dNdScv.
Genomic regions of interest
In the current release, the genomic regions considered for training and prediction include the entire coding sequence according to MANE Select, including stop codons (as in the previous release) as well as the 5 noncoding base-pairs flanking each exonic coding sequence segment. The inclusion of the non coding base-pairs intends to exploit the signals of positive selection cast by noncoding splicing-affecting mutations. The mapping is done according to the MANE Select transcripts retrieved from VEP v.111.
Updated consequence type definitions
We have extended the set of sequence ontology (SO) terms that map to the consequence classification we employ in the pipeline, which consists of four consequence types: “missense”, “nonsense”, “splicing” and “synonymous”.
- The term nonsense now includes the SO terms:
stop_gained
stop_lost
start_lost
- The term splicing now include the SO terms:
splice_donor_variant
splice_acceptor_variant
splice_region_variant
splice_donor_5th_base_variant
splice_donor_region_variant
splice_polypyrimidine_tract_variant
intron_variant
Pfam domains
We have updated the version of Pfam to v35.0.
Oncotree
As tumor type ontology for hierarchical model implementation across tumor types we now use the oncotree version boostdm2023 (see Downloads) which is based on the 2021 version of the MSKCC oncotree.
Bug Fixing and technical updates
Fixed reference genome incompatibility
Recently we found a misspecification in the codebase that prevented the randomization of passenger mutations as intended. As described in the main documentation, our method resorts to trinucleotide context probabilities derived from the observed frequencies in the cohort mutational data. With these probabilities we can then draw random mutations to create the set of passenger mutations used as a negative set in the supervised learning step.
Because of the unintended usage of a reference genome inconsistent with the rest of the pipeline in the step that converted genomic coordinates into reference triplets prior to randomization, there was a mismatch between sites and reference triplets. Consequently, the nondriver training set was drawn in a trinucleotide-agnostic way.
In the current version the hg38 genome build is consistently used throughout the entire pipeline.
Intogen integration
The output of Intogen now generates a unified data environment for the boostDM pipeline to be run from the new outputs of Intogen with minimum preprocessing required. For more details, check out the Intogen website.
New pipeline steps
The current version of the pipeline adds two supplemental steps to streamline post-hoc analysis, including model visualization and benchmarking analysis.
Plotting
All the plots used for basic model visualization in the boostDM website are automatically generated as part of a supplemental step in the pipeline. There are three types of plots:
Blueprints.
Stack representation mapping the gene protein coordinates with different annotation layers, from top to bottom: observed mutation frequencies, boostDM score, missense/nonsense driver mutations with high and low confidence tiers, Pfam domains, feature annotations
Clustered blueprints.
Hierarchical clustering representation of several in silico saturation mutagenesis predictions across the collection of tumor types for which we can cast a prediction with the same model tumor type model as features tumor type, i.e. there is a high quality model for the gene in the tumor type context and this model is used to cast the predictions feeding on the feature annotations calculated from this tumor type context.
Discovery index.
Scatter plot and best inverse exponential fitting curve representing the subsampling experiment used in the calculation of the discovery index. The gray dots represent the number of unique mutations in the pool of KRAS mutations of random subsets of tumors taken from the whole tumor type cohort. The inverse exponential fitting bending indicates how close to saturation, i.e. the more the curve is bent, the less likely new mutations are discovered as we sequence more.
Benchmarks
Given the importance for post-hoc evaluation, the boostDM pipeline now includes an expanded version of the benchmarks accompanying the first release of the method (https://doi.org/10.1038/s41586-021-03771-1). We have included both experimental saturation mutagenesis datasets and bioinformatic scores retrieved through dbNSFP. Please, have a look at the full docs.
TL;DR
- Adapt BoostDM pipeline to accept MANE transcript by @FedericaBrando in #4
- New step that generates output plots by @koszulordie in #13
- New step in the pipeline to conduct benchmarking by @koszulordie in #15
- Model Selection: Two-step model selection with updated discovery index tiers and minimum mutation counts; models only trained if training set is adequate.
- Prediction Output Update: Format changes in prediction output files, with explicit tumor type contexts.
- Data Updates:
- VEP Update: Switched to ENSEMBL VEP v.111.
- Transcripts: Transitioned to MANE Select transcripts.
- Regions of Interest: Expanded to include 5 flanking noncoding base-pairs.
- Consequence Types: Updated classification, including more SO terms.
- Pfam & Oncotree: Updated to Pfam v35.0 and oncotree boostdm2023.
Full Changelog: https://github.com/bbglab/boostdm-pipeline/commits/2024.07.15-cancer