Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The calculation method of AP field #52

Open
xbz17 opened this issue Dec 14, 2024 · 1 comment
Open

The calculation method of AP field #52

xbz17 opened this issue Dec 14, 2024 · 1 comment

Comments

@xbz17
Copy link

xbz17 commented Dec 14, 2024

Hello,
Thanks for your wonderful tools TRGT. Currently I'm using the version1.1.1 with the TR catalog from Platinum Tandem Repeats, and have some problems in the AP field of output-vcf.
One of the TR result like this:
image

sequence:
GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCTCCCCTCATCACCTCCCCAGCCAC

and the plot:
Pasted image 20241214122956

The AP of this TR is "0.145455,0.145455" which is a very low value comparing to most of other TR. And the motif ACCC has two repeats in two places, which are separated by a long sequence compared with the length of motif.
I would like to know how the AP is calculated here, 0.145455 seems like the result of 4*4(ACCC)/110(length of whole suquence), and whether the user should be warned in the output that there is a big break in the repetition of this TR? Because there are also some other STR results that retrieve all parts of a long sequence that match the motif, but are not actually "tandem", result in low AP values as well.

@egor-dolzhenko
Copy link
Collaborator

Thank you for the question. The AP / purity field is meant to indicate how close an allele sequence is to being a perfect repeat composed of the specified motif(s). (The actual algorithm is based on computing the edit distance between the given sequence and the corresponding perfect repeat.) It sounds like your understanding is correct: When the purity is low, the allele sequence contains a small number of perfect motif copies relative to its length. This can occur in several ways: the allele can contain a few perfect motif copies with the rest of the sequence not matching the motif at all; or there could be many imperfect motif copies scattered throughout the repeat sequence. The information about the location of these matches can be found in the MS field (described here). I agree that it would be convenient to add additional output fields that summarize different repeat configurations (especially for low purity repeats like in your example). This is something that we are continuing to work on. Did I answer your question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants