Skip to content

Optimize

Doga C. Gulhan edited this page Nov 13, 2022 · 8 revisions

For obtaining the best performance it is important that the simulations used in the tuning of the MVA models agree with the dataset that is being analyzed, therefore we added the functionality for the users to easily optimize SigMA parameters for their dataset.

Option 1: Train a new model

Training of the new model saving it to the local system files of SigMA is currently possible and it is the suggested option.

Example macro

An example macro that guides the users through steps of training and testing can be found in SigMA/examples/test_tune_example.R.

Note: Currently only quick simulations are available, while the simulations for built-in SigMA packages were done specifically for libraries of the sequencing platforms used in the sequencing of the datasets. Please follow the instructions below and if you get unexpected results contact the authors. Quick simulations may diminish the performance of the algorithm.

Training

  1. Create the matrix of mutational spectra using make_matrix() function, see Input section here.
  2. Remove samples with hypermutations due to mismatch repair deficiency or POLE-exo mutations. You may already have this annotation from immunohistochemistry. If not run SigMA once and remove these samples and save a new file that contains the mutation matrix. The corresponding lines in the test macro.
  3. Find the best data parameter for the run() function. See in the test macro.. This is used when running SigMA once to select a tentative Sig3+ sample set.
  4. Generate simulations. You can generate quick simulations by downsampling the WES/WGS mutation matrices saved within the model. See the example. You may want to generate your own simulations using the coverage of a specific sequencing platform. If this is the case make sure to check the distribution average SNV count per sample (total_snvs column) in the simulations and data agree.
  5. Tune the new model. You can either use the quick simulations (see in the example) or use your own simulations. You can use your own simulation file for the simul_file parameter in tune_new_gbm() function. This minimal amount of information file should contain the mutational spectra in first 96 columns and binary truth tags indicating the presence of a signature (can be Sig3 or other signatures as well). You need to provide the column name of this the tags using colname_truth_tag parameter in tune_new_gbm() function. The default is is_sig3 which is filled in the built in matrices.

Calculating the false positive rate

  1. The tune_new_gbm() function created an output file that has the same name as the simulated data file with an addition of predictions.csv at the end of the file name See the example..
  2. Set the desired false positive rate (FPR) thresholds. See here.. You may want to use sensitivity or false discovery rate (FDR) rather than FPR then change 'fpr' with 'sen' or 'fdr'.
  3. Determine the threshold on the score calculated with gradient boosting that corresponds to those FPR settings in simulated data.See the example..

Adding to system files for future use

  1. Load the trained model (the .rda file created by tune_new_gbm().
  2. Add the model and the thresholds that correspond to certain FPR values using add_gbm_model() function. See the example
  3. You saved a trained gradient boosting classifier in the external datafiles of SigMA (SigMA/inst/extdata/gbm_models) which can be accessed by run() function. Set the data = <new_data>, and custom= T and provide a trinucleotide normalization weight norm96 in run() function. The trinucleotide frequency weight can be calculated using get_trinuc_norm() function.

Option 2: Use an existing classifier but optimize the cut-off

You can use built in models and adjust the thresholds applied on the MVA score. The thresholds can be determined for specific sensitivity or false positive rate by generating new simulations that have matching SNV counts as in your dataset.

An example macro to do this calculation can be found in SigMA/examples/test_determine_cutoff.R. To summarize what is done in the example:

  • The best data setting is determined to use in run() function with find_data_setting() function. This simply compares the mutation counts in your dataset to the ones used in training the built in models and returns the closest option.
  • Simulations are generated from WGS data using quick_simulations() function
  • SigMA is run with the data option determined in step a
  • Thresholds for specific FPR and sensitivity settings are determined with get_threshold() function.
  • Thresholds are provided to run() function and SigMA is run on the dataset. Columns indicating the presence of Sig3 is generated automatically based by cutting on the Signature_3_mva column based on the thresholds fed into the run function.

Step by step

1. Copy the example to edit it for your own dataset cd SigMA cp SigMA/examples/test_determine_cutoff.R test_determine_cutoff_mine.R

2. Edit the input dataset to point to your file with the SigMA input file, to create the input file see lines 1-21 in test.R or test_maf.R.

Replace:

data_dir <- system.file("extdata/matrices/matrices_96dim.rda", package="SigMA")
load(data_dir)

tumor_type <- 'breast'
remove_msi_pole <- F

m <- matrices_96dim[['tcga']][[tumor_type]]

write.table(m, 'tmp.csv', row.names = F, sep = ',', quote = F)
input_file <- 'tmp.csv'

With

input_file <- <path_to_your_local_input>
tumor_type <- tumor_type> # tumor_type for your dataset
remove_msi_pole <- T # F if you are certain there are no mismatch repair deficient or POLE-exo mutated tumors in your dataset

3. Choose a data option which selects the built in model that will be used

  • If you are not sure which data option is the best you can find the suitable starting point by the following lines:
data_val <- find_data_setting(input_file, 
  tumor_type,  
  remove_msi_pole = remove_msi_pole)

You can set:

data_val <- <your_choice_for_data_setting>

4. Choose false positive rate values or sensitivity values you would like your cutoff to be set at

cut_var <- 'fpr' # you change to 'sen'
fpr_limits <- c(0.1, 0.05) # you can change to e.g. c(0.5, 0.7, 0.9, 1.) for 'sen' setting

5. Run the modified macro

Rscript test_determine_cutoff_mine.R

You will obtain an output file with sensitivity false positive rate and threshold values:

df_sen_fpr_example.csv

and another file with the SigMA results, the name of the output file will be printed on the screen.

If the sensitivity values are too low for reasonable FPR settings or vice versa, this may mean that you need to retune a new classifier rather than simply optimizing the cutoffs for existing models. For this see Option 2 below.

6. To produce the lite file format set one of your new columns in the SigMA output named as pass_mva_<cut_var>_ to be pass_mva_strict and pass_mva, for the strict and looser selection thresholds, then use the lite_df() function. In the example macro this is done at the final lines:

if(cut_var == 'fpr' | cut_var == 'fdr'){
  strict_limit <- min(limits)
  loose_limit <- max(limits)
}else if(cut_var == 'sen'){
  strict_limit <- max(limits)
  loose_limit <- min(limits)
}else{
  stop('cut_var can be sen, fpr or fdr')
}

colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(strict_limit, digit = 2))] <- 'pass_mva_strict'
colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(loose_limit, digit = 2))] <- 'pass_mva'

df_lite <- lite_df(df)

The categ column in the saved lite file indicates summarizes the signature categories of the samples in your dataset.