-
Notifications
You must be signed in to change notification settings - Fork 21
Optimize
For obtaining the best performance it is important that the simulations used in the tuning of the MVA models agree with the dataset that is being analyzed, therefore we added the functionality for the users to easily optimize SigMA parameters for their dataset.
Training of the new model saving it to the local system files of SigMA is currently possible and it is the suggested option.
An example macro that guides the users through steps of training and testing can be found in SigMA/examples/test_tune_example.R
.
Note: Currently only quick simulations are available, while the simulations for built-in SigMA packages were done specifically for libraries of the sequencing platforms used in the sequencing of the datasets. Please follow the instructions below and if you get unexpected results contact the authors. Quick simulations may diminish the performance of the algorithm.
- Create the matrix of mutational spectra using
make_matrix()
function, see Input section here. - Remove samples with hypermutations due to mismatch repair deficiency or POLE-exo mutations. You may already have this annotation from immunohistochemistry. If not run SigMA once and remove these samples and save a new file that contains the mutation matrix. The corresponding lines in the test macro.
- Find the best
data
parameter for therun()
function. See in the test macro.. This is used when running SigMA once to select a tentative Sig3+ sample set. - Generate simulations. You can generate quick simulations by downsampling the WES/WGS mutation matrices saved within the model. See the example. You may want to generate your own simulations using the coverage of a specific sequencing platform. If this is the case make sure to check the distribution average SNV count per sample (
total_snvs
column) in the simulations and data agree. - Tune the new model. You can either use the quick simulations (see in the example) or use your own simulations. You can use your own simulation file for the
simul_file
parameter intune_new_gbm()
function. This minimal amount of information file should contain the mutational spectra in first 96 columns and binary truth tags indicating the presence of a signature (can be Sig3 or other signatures as well). You need to provide the column name of this the tags usingcolname_truth_tag
parameter intune_new_gbm()
function. The default isis_sig3
which is filled in the built in matrices.
- The
tune_new_gbm()
function created an output file that has the same name as the simulated data file with an addition ofpredictions.csv
at the end of the file name See the example.. - Set the desired false positive rate (FPR) thresholds. See here.. You may want to use sensitivity or false discovery rate (FDR) rather than FPR then change
'fpr'
with'sen'
or'fdr'
. - Determine the threshold on the score calculated with gradient boosting that corresponds to those FPR settings in simulated data.See the example..
- Load the trained model (the
.rda
file created bytune_new_gbm()
. - Add the model and the thresholds that correspond to certain FPR values using
add_gbm_model()
function. See the example - You saved a trained gradient boosting classifier in the external datafiles of SigMA (
SigMA/inst/extdata/gbm_models
) which can be accessed byrun()
function. Set thedata = <new_data>
, andcustom= T
and provide a trinucleotide normalization weightnorm96
inrun()
function. The trinucleotide frequency weight can be calculated usingget_trinuc_norm()
function.
You can use built in models and adjust the thresholds applied on the MVA score. The thresholds can be determined for specific sensitivity or false positive rate by generating new simulations that have matching SNV counts as in your dataset.
An example macro to do this calculation can be found in SigMA/examples/test_determine_cutoff.R
. To summarize what is done in the example:
- The best data setting is determined to use in run() function with find_data_setting() function. This simply compares the mutation counts in your dataset to the ones used in training the built in models and returns the closest option.
- Simulations are generated from WGS data using quick_simulations() function
- SigMA is run with the data option determined in step a
- Thresholds for specific FPR and sensitivity settings are determined with get_threshold() function.
- Thresholds are provided to run() function and SigMA is run on the dataset. Columns indicating the presence of Sig3 is generated automatically based by cutting on the
Signature_3_mva
column based on the thresholds fed into the run function.
1. Copy the example to edit it for your own dataset
cd SigMA
cp SigMA/examples/test_determine_cutoff.R test_determine_cutoff_mine.R
2. Edit the input dataset to point to your file with the SigMA input file, to create the input file see lines 1-21 in test.R or test_maf.R.
Replace:
data_dir <- system.file("extdata/matrices/matrices_96dim.rda", package="SigMA")
load(data_dir)
tumor_type <- 'breast'
remove_msi_pole <- F
m <- matrices_96dim[['tcga']][[tumor_type]]
write.table(m, 'tmp.csv', row.names = F, sep = ',', quote = F)
input_file <- 'tmp.csv'
With
input_file <- <path_to_your_local_input>
tumor_type <- tumor_type> # tumor_type for your dataset
remove_msi_pole <- T # F if you are certain there are no mismatch repair deficient or POLE-exo mutated tumors in your dataset
3. Choose a data
option which selects the built in model that will be used
- If you are not sure which data option is the best you can find the suitable starting point by the following lines:
data_val <- find_data_setting(input_file,
tumor_type,
remove_msi_pole = remove_msi_pole)
- OR You can see the available options.
You can set:
data_val <- <your_choice_for_data_setting>
4. Choose false positive rate values or sensitivity values you would like your cutoff to be set at
cut_var <- 'fpr' # you change to 'sen'
fpr_limits <- c(0.1, 0.05) # you can change to e.g. c(0.5, 0.7, 0.9, 1.) for 'sen' setting
5. Run the modified macro
Rscript test_determine_cutoff_mine.R
You will obtain an output file with sensitivity false positive rate and threshold values:
df_sen_fpr_example.csv
and another file with the SigMA results, the name of the output file will be printed on the screen.
If the sensitivity values are too low for reasonable FPR settings or vice versa, this may mean that you need to retune a new classifier rather than simply optimizing the cutoffs for existing models. For this see Option 2 below.
6. To produce the lite file format set one of your new columns in the SigMA output named as pass_mva_<cut_var>_ to be pass_mva_strict
and pass_mva
, for the strict and looser selection thresholds, then use the lite_df() function. In the example macro this is done at the final lines:
if(cut_var == 'fpr' | cut_var == 'fdr'){
strict_limit <- min(limits)
loose_limit <- max(limits)
}else if(cut_var == 'sen'){
strict_limit <- max(limits)
loose_limit <- min(limits)
}else{
stop('cut_var can be sen, fpr or fdr')
}
colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(strict_limit, digit = 2))] <- 'pass_mva_strict'
colnames(df)[colnames(df) == paste0('pass_mva_', cut_var, '_', round(loose_limit, digit = 2))] <- 'pass_mva'
df_lite <- lite_df(df)
The categ
column in the saved lite file indicates summarizes the signature categories of the samples in your dataset.