Skip to content

Optimize

Doga C. Gulhan edited this page Apr 14, 2020 · 8 revisions

For obtaining the best performance it is important that the simulations used in the tuning of the MVA models agree with the dataset that is being analyzed, therefore we added the functionality for the users to easily optimize SigMA parameters for their dataset.

Does my dataset fit any built in model?

If your dataset agrees with the SNV counts that was used to tune the built in models: ``

If your dataset doesn't fit a built in model

Option 1:

You can use built in models and adjust the thresholds applied on the MVA score. The thresholds can be determined for specific sensitivity or false positive rate by generating new simulations that have matching SNV counts as in your dataset. An example macro to do this calculation can be found in SigMA/examples/test_determine_cutoff.R. To summarize what is done in the example: a. The best data setting is determined to use in run() function with find_data_setting() function. This simply compares the mutation counts in your dataset to the ones used in training the built in models and returns the closest option.
b. Simulations are generated from WGS data using quick_simulations() function c. SigMA is run with the data option determined in step a d. Thresholds for specific FPR and sensitivity settings are determined with get_threshold() function. e. Thresholds are provided to run() function and SigMA is run on the dataset. Columns indicating the presence of Sig3 is generated automatically based by cutting on the Signature_3_mva column based on the thresholds fed into the run function.

1. Copy the example to edit it for your own dataset cd SigMA cp SigMA/examples/test_determine_cutoff.R test_determine_cutoff_mine.R

2. Edit the input dataset to point to your file with the SigMA input file, to create the input file see lines 1-21 in test.R or test_maf.R.

Replace:

data_dir <- system.file("extdata/matrices/matrices_96dim.rda", package="SigMA")
load(data_dir)

tumor_type <- 'breast'
remove_msi_pole <- F

m <- matrices_96dim[['tcga']][[tumor_type]]

write.table(m, 'tmp.csv', row.names = F, sep = ',', quote = F)
input_file <- 'tmp.csv'

With

input_file <- <path_to_your_local_input>
tumor_type <- tumor_type> # tumor_type for your dataset
remove_msi_pole <- T # F if you are certain there are no mismatch repair deficient or POLE-exo mutated tumors in your dataset

3. Choose a data option which selects the built in model that will be used

  • If you are not sure which data option is the best you can find the suitable starting point by the following lines:
data_val <- find_data_setting(input_file, 
  tumor_type,  
  remove_msi_pole = remove_msi_pole)

You can set:

data_val <- <your_choice_for_data_setting>

4. Choose false positive rate values or sensitivity values you would like your cutoff to be set at

cut_var <- 'fpr' # you change to 'sen'
fpr_limits <- c(0.1, 0.05) # you can change to e.g. c(0.5, 0.7, 0.9, 1.) for 'sen' setting

5. Run the modified macro

Rscript test_determine_cutoff_mine.R

You will obtain an output file with sensitivity false positive rate and threshold values:

df_sen_fpr_example.csv

and another file with the SigMA results, the name of the output file will be printed on the screen.

If the sensitivity values are too low for reasonable FPR settings or vice versa, this may mean that you need to retune a new classifier rather than simply optimizing the cutoffs for existing models. For this see Option 2 below.

6. To produce the lite file format set one of your new columns in the SigMA output named as pass_mva_<cut_var>_ to be pass_mva_strict and pass_mva, for the strict and looser selection thresholds, then use the lite_df() function. In the example macro this is done at the final lines:

if(cut_var == 'fpr' | cut_var == 'fdr'){
  strict_limit <- min(limits)
  loose_limit <- max(limits)
}else if(cut_var == 'sen'){
  strict_limit <- max(limits)
  loose_limit <- min(limits)
}else{
  stop('cut_var can be sen, fpr or fdr')
}

colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(strict_limit, digit = 2))] <- 'pass_mva_strict'
colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(loose_limit, digit = 2))] <- 'pass_mva'

df_lite <- lite_df(df)

The categ column in the saved lite file indicates summarizes the signature categories of the samples in your dataset.

Option 2:

You can tune a new model for your dataset, following the tuning tutorial.

Clone this wiki locally