Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The difference between two genotyper algorithms #50

Open
ywzhang071394 opened this issue Dec 4, 2024 · 6 comments
Open

The difference between two genotyper algorithms #50

ywzhang071394 opened this issue Dec 4, 2024 · 6 comments

Comments

@ywzhang071394
Copy link

Hi,

Thank you for the nice tool! Could you give a introduction on the two genotyper algorithms "size" and "cluster".
I looked through your paper and github page but did not find any related info.

Thanks

@pbsena
Copy link
Contributor

pbsena commented Dec 4, 2024

Hello,

we recommend the use of the cluster genotyper when there are not many repeats to genotype. The cluster genotyper will compare all STR sequences with each other and cluster them, whereas the size genotyper splits reads based on the STR sequence size in each read to maximize the difference between alleles to the difference within them. The cluster genotyper uses more information but is significantly slower, so for WGS applications we recommend using size. The size genotyper is described in the TRGT manuscript, whereas the cluster genotyper algorithm paper is still a work in progress.

Hope this helps! Happy to clarify further.

@ywzhang071394
Copy link
Author

ywzhang071394 commented Dec 4, 2024

Thank you so much!
We previously detected a wired repeat expansion event based on the 'size' algorithm, as shown below. The alternative sequence (human T2T reference) is the actual downstream sequence of "AATTTTCTATTTTTATTTTTATTTTT". How can it be classified into an expansion event? Could you help check it?
Thanks!

chr2 74292837 . AATTTTCTATTTTTATTTTTATTTTT AATTTTCTATTTTTATTTTTATTTTTGTAGAGACGAGGTCTTCCTATGTTGTCCAGGCTGGTCTTGAACTCCTGGGCTCAAGCAATCTGCCTGTCTTGGCCTCTCAAAATGTTGGCATTACAGGCATGAGCCACTGTGCCTAGCCCTTATTCCTTATTTCTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGTCGGACTGCGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTTTTTGTATTTTTA 0 . TRID=chr2_74292837_74292862_trsolve;END=74292862;MOTIFS=TATTTT;STRUC=(TATTTT)n GT:AL:ALLR:SD:MC:MS:AP:AM 0/1:25,359:23-25,358-361:70,7:4,14:0(0-25),0(0-24)_0(146-184)_0(337-359):0.88,0.189415:.,0.69

and we cannot find any supporting reads in IGV.

image

my code is
/software/trgt genotype -g /data/human/t2t_chm13/chm13v2.0.fa -r /Longread/t2t/${i}_Tumor/04_phase/tumor_haplotagSV.bam -b /data/human/t2t_chm13/chm13v2.0_maskedY_rCRS.platinumTRs-v1.0.trgt.bed -k XY --output-prefix ${path}/Tumor_cluster -t 10

@pbsena
Copy link
Contributor

pbsena commented Dec 5, 2024

Hello,

Without looking at the reads, I'd guess, since this region is repetitive, that, when looking for flanking bases, some reads aligned to the repetitive sequence donwstream of this repeat. Maybe try to increase the --flank-len parameter to, say, 500?

@ywzhang071394
Copy link
Author

Thank you for the suggestion. I tested the cluster algorithm, and found that this false positive event was absent. It seems that the cluster algorithm improves the repeat expansion detection a lot and is not such time-consuming as I expected.
Thanks a lot!

@egor-dolzhenko
Copy link
Collaborator

@ywzhang071394, thank you for letting us know. Would you be open to sharing a waterfall plot of this repeat with us? We could do it by email if you prefer.

@ywzhang071394
Copy link
Author

Hi,

Sorry I need to reopen this issue, since I want to debug the size algorithm. I tried to increase --flank-len parameter to 500, but nothing was improved. Could you help comment on this?

@ywzhang071394 ywzhang071394 reopened this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants