Skip to content

Parameter choices

Doga C. Gulhan edited this page Nov 13, 2022 · 8 revisions

You can train a classifier with SigMA without relying on built-in classifiers and for different choices of signature catalogs.

Using built-in classifiers

Choice of catalog

In run() function the catalog is defined using catalog_name parameter.

For Sig3 prediction with gradient boosting classifiers use cosmic_v2_inhouse which contains v2 COSMIC catalog with the addition of signatures discovered from WGS data that did not match the catalog.

For MMRD prediction using predict_mmrd function using cosmic_v3_inhouse which contains v3 COSMIC catalog with the addition of the same signatures as in cosmic_v2_inhouse.

Other available catalogs, that can be used in the training of new models:

  • cosmic_v2: COSMIC catalog v2
  • cosmic_v3p2: COSMIC catalog v3.2
  • cosmic_v3p2_inhouse: COSMIC catalog v3.2 with added signatures as described in the following preprint.

The sequencing method panel, WES or WGS

Is defined using the data parameter in run() function:

  • wgs: for whole-genome sequencing
  • tcga_mc3, seqcap or seqcap_probe: for whole-exome sequencing
  • msk: for MSK-IMPACT panels and could be used as a starting point for > 300 gene panels to train a new model.

There are three available classifiers for WES data, what is the difference?

There are three values the data parameter can be set to that corresponds to whole exome data seqcap, seqcap_probe and tcga_mc3. Note that data = tcga_mc3 is not specific to the analysis of TCGA data, and can be used for exome data in general. The tcga_mc3 contains classifiers for the largest number of tumor types. However, some tumor types, which do not have a corresponding classifier in tcga_mc3 but in seqcap or seqcap_probe are ewing and medullo. The differences between the tcga_mc3 and seqcap or seqcap_probe is the level of subsampling from WGS data that was determined based on specific mutation calls from TCGA data MC3 calls obtained from the PanCanAtlas database.

Tumor-type tags

You can find which tumor type corresponds to which tumor_type tag in run() function.

Which tumor type and data settings are compatible with a built-in multivariate Sig3 classifier?

Using list_tumor_types() function you can see the options for tumor_type parameter for the run() function, similarly use list_data_options() to see the available data parameters. You can see if the SNV counts in your data agree with the mutation counts in the datasets for tuning the gradient boosting classifiers by running an info(data, tumor_type). If the values disagree the cutoffs on the score listed in Signature_3_mva column of SigMA output need to be optimized, or a new model needs to be tuned.

Using SigMA without Sig3 classifier

SigMA can be used without an MVA classifier. If there is no built-in model available Sig3 can still be studied by setting do_mva = F and do_assign = F in run() function. If you want to investigate the presence of Signature_3 set add_sig3 = T which will then allow the assignment of Signature 3 to the tumors even if this signature was not discovered by NMF in the WGS data for these tumor types, if it is already present for that tumor type in the WGS data no changes will be made.

See also

  • check_msi: when set to TRUE includes MMRD related calculations
  • add_sig3: adds Sig3 to the set of signatures to be considered for that tumor type and a Sig3+ expected probability distribution calculated from Sig3 cluster in breast cancer data. Set it true to calculate likelihoods in tumor types where Sig3 has not been discovered in the analysis of WGS data which can happen when WGS dataset size is small or Sig3 is rare.
  • snv_cutoff: minimum SNV counts for which a signature assignment will be made.

Using a custom classifier

See the dedicated page.