Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is detecting and genotyping Short Tandem Repeats (STRs) challenging? #21

Open
CSU-KangHu opened this issue Dec 4, 2023 · 3 comments

Comments

@CSU-KangHu
Copy link

Hi,
Thank you for developing the excellent TRGT tool. I've read your paper "Resolving the unsolved: Comprehensive assessment of tandem repeats at scale". To gain a better understanding, I've also read several other papers on STR detection and genotyping. However, I'm still confused by the following questions:

  1. TRGT requires specifying the parameter --repeats <REPEATS> BED file with reference coordinates and the structure of tandem repeats. Since we know the structure and location of motifs on the reference genome, what are the challenges in detecting motifs and their repeat counts in reads? What distinguishes TRGT from existing tools like straglr and RepeatHMM?

  2. How can we evaluate the performance between TRGT and various STR detection and genotyping tools? Are there established and reliable benchmark datasets available for this purpose?

@egor-dolzhenko
Copy link
Collaborator

Thank you for the questions. In many cases identifying and counting motifs in reads is straightforward. But sometimes it gets more complicated because of mosaicim, sequence composition changes, nested repeats, etc... Different tools resolve these challenges in different ways and may also be designed to profile different kinds of repeats. It would makes sense to pick a tool that best aligns with the needs of your project. As for benchmarking, here is a recent paper that proposes a new benchmarking framework designed specifically for tandem repeats. Many groups that work on repeat expansions also sequence some samples with known expansions of repeats they are interested in and then confirm that their tool of choice can detect them. I hope this response is helpful!

@minghuaxu
Copy link

Hi @egor-dolzhenko,
Thank you for presenting the TR catalog and variant benchmark files in the 'Benchmarking of small and large variants across tandem repeats' paper.
I am curious whether Truvari refine method can evaluate the performance of TR call tools such as TRGT and straglr, enabling the assessment of metrics like accuracy, F1 score, and recall for these tools.
Given that the groundtruth VCF file lacks motif and number of repeats information, the Truvari refine method evaluates TR called results based on sequence similarity, size similarity, and other indicators. Whether it is possible to evaluate the accuracy of motifs and number of repeats in the TR call set?

@egor-dolzhenko
Copy link
Collaborator

Thank you for the question. Yes, Truvari is a sequence-level benchmark. In my opinion, evaluating the accuracy of motifs and their counts is a much more elusive task. For example, some tools might count only the exact motif copies while other tools might also detect imperfect motifs. Because of this, different tools might produce very different motif counts which would all be "correct". When it comes to resolving motif counts, it might be best to do a project-specific benchmarking study and pick a tool that best fits the needs of the specific project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants