Skip to content

Understantding orthology scores

Bogdan Kirilenko edited this page Jul 13, 2023 · 4 revisions

Parent page - Interpreting results

For each pair of reference annotation transcript and intersecting genome alignment chain TOGA computes an "orthology score" - a numeric feature, that may be: (1) a number between 0.0 and 1.0, (2) be a number -1.0, and (3) be equal to -2.0

The assigned scores can be found in the ${toga_output_dir}/temp/orthology_scores.tsv. The scores are used later to decide what exactly to annotate.

An overview of different values is provided below (in progress).

Range from 0.0 to 1.0

Assigned using XGBoost model. To be filled. Normal range. 0.0 - minimal score, highly unlikely an ortholog 1.0 - maximal score, most likely an ortholog

By default, 0.5 is used as a threshold to differentiate orthologs from paralogs.

Issue related to negative scores

Score == -1

Spanning chains. -1 in the orthology scores file means that the chain is spanning - has no alignment to the coding part of the gene. For such cases, we cannot compute the full set of features properly, and xgboost model is inapplicable. In 99% of cases, it means that the gene in the respective locus is either missing or is deleted. However, TOGA still tries to annotate such cases.

Score == -2

Processed pseudogenes. They have a very specific set of features so Machine Learning was not used to identify them.