Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of interruption in HTT #51

Open
mate-ldw opened this issue Dec 11, 2024 · 2 comments
Open

Loss of interruption in HTT #51

mate-ldw opened this issue Dec 11, 2024 · 2 comments

Comments

@mate-ldw
Copy link

Hello,
I am new to the analysis of trgt. We run the analysis pipeline through SMRT Link. Currently there is trgt v0.8 installed.

I am looking at the HTT locus of the vcf and found an issue. Or maybe a misunderstanding on my part.

The vcf looks like this:

GT:AL:ALLR:SD:MC:MS:AP:AM
1/2:93,171:86-109,161-201:144,134:19_12,48_9:0(0-57)_1(57-93),0(0-144)_1(144-171):0.978495,0.988304:0.14,0.15

The first allele has 19 CAG and 12 CCG according to the motif count "MC".
The second allele has 48 CAG and 9 CCG according to the motif count "MC".

But in the sequences in the vcf I see this:

CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCGCCGCCG,
CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCG

The first allele has 17 CAG, 1 CAA, 1 CAG, 1 CCG, 1 CCA and 10 CCG.
The second allele has 46 CAG, 1 CAA, 1 CAG, 1 CCG, 1 CCA and 7 CCG.

So the interpretation counts the CAA interruption also as CAG or CCG. I understood that this purity score reflects this interruption but in the example shown here (https://github.com/PacificBiosciences/trgt/blob/main/docs/vcf_files.md) it is also visualised and reflected in the motif span (MS).

The tvz motifs allele svg does not show the interruption (it shows 19_12,48_9). However, the waterfall plot does show the interruption.

Is it normal that one repeat interruption is not counted? Especially at the end of this motif where the structure is well known or is this changed in later versions of trgt and the documentation is based on the newer version.

Many thanks in advance!
Max

@egor-dolzhenko
Copy link
Collaborator

Hello Max,

Thank you for brining this up. The current motif counting algorithm in TRGT allows some imperfections in the repeat sequence. This is what is causing CAACAG to be counted as two CAGs (one imperfect and one perfect motif copy). Unfortunately assessment of different repeats involves different rules (sometimes counting imperfections / interruptions and sometimes not), making it difficult to implement repeat segmentation that would be compatible with all assessment strategies. Because of this, it is pretty common to post-process TRGT's repeat sizes (for example by subtracting two motif counts for HTT alleles) to get the size estimates you are looking for. Does this sound reasonable?

Also the upcoming version of TRGT should make motif counting much more flexible (and should eliminate the need for any adjustment of the HTT allele lengths). If you'd like, we can send you a binary of this development version by the end of the week (just send me an email).

Best wishes,
Egor

@mate-ldw
Copy link
Author

Dear Egor,

Thank you for the quick response and the explanation. This is very interesting and important to consider as those interruptions or the loss of those interruptions can be clinically relevant. This is also true for other repeat expansions such as in FMR1.

We would be happy to try the development version as soon as it is ready. I will contact you via email.

Best regrads,
Max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants