-
Notifications
You must be signed in to change notification settings - Fork 21
Parameter choices
You can train a classifier with SigMA without relying on built-in classifiers and for different choices of signature catalogs.
In run()
function the catalog is defined using catalog_name
parameter.
For Sig3 prediction with gradient boosting classifiers use cosmic_v2_inhouse
which contains v2 COSMIC catalog with the addition of signatures discovered from WGS data that did not match the catalog.
For MMRD prediction using predict_mmrd
function using cosmic_v3_inhouse
which contains v3 COSMIC catalog with the addition of the same signatures as in cosmic_v2_inhouse
.
Other available catalogs, that can be used in the training of new models:
-
cosmic_v2
: COSMIC catalog v2 -
cosmic_v3p2
: COSMIC catalog v3.2 -
cosmic_v3p2_inhouse
: COSMIC catalog v3.2 with added signatures as described in the following preprint.
Is defined using the data
parameter in run()
function:
-
wgs
: for whole-genome sequencing -
tcga_mc3
,seqcap
orseqcap_probe
: for whole-exome sequencing -
msk
: for MSK-IMPACT panels and could be used as a starting point for > 300 gene panels to train a new model.
There are three values the data
parameter can be set to that corresponds to whole exome data seqcap
, seqcap_probe
and tcga_mc3
. Note that data = tcga_mc3
is not specific to the analysis of TCGA data, and can be used for exome data in general. The tcga_mc3
contains classifiers for the largest number of tumor types. However, some tumor types, which do not have a corresponding classifier in tcga_mc3
but in seqcap
or seqcap_probe
are ewing
and medullo
. The differences between the tcga_mc3
and seqcap
or seqcap_probe
is the level of subsampling from WGS data that was determined based on specific mutation calls from TCGA data MC3 calls obtained from the PanCanAtlas database.
You can find which tumor type corresponds to which tumor_type
tag in run()
function.
Using list_tumor_types()
function you can see the options for tumor_type
parameter for the run()
function, similarly use list_data_options()
to see the available data
parameters. You can see if the SNV counts in your data agree with the mutation counts in the datasets for tuning the gradient boosting classifiers by running an info(data, tumor_type). If the values disagree the cutoffs on the score listed in Signature_3_mva
column of SigMA output need to be optimized, or a new model needs to be tuned.
SigMA can be used without an MVA classifier. If there is no built-in model available Sig3 can still be studied by setting do_mva = F
and do_assign = F
in run()
function. If you want to investigate the presence of Signature_3 set add_sig3 = T
which will then allow the assignment of Signature 3 to the tumors even if this signature was not discovered by NMF in the WGS data for these tumor types, if it is already present for that tumor type in the WGS data no changes will be made.
-
check_msi
: when set to TRUE includes MMRD related calculations -
add_sig3
: adds Sig3 to the set of signatures to be considered for that tumor type and a Sig3+ expected probability distribution calculated from Sig3 cluster in breast cancer data. Set it true to calculate likelihoods in tumor types where Sig3 has not been discovered in the analysis of WGS data which can happen when WGS dataset size is small or Sig3 is rare. -
snv_cutoff
: minimum SNV counts for which a signature assignment will be made.
See the dedicated page.