Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate copy number #45

Open
oleraj opened this issue Sep 28, 2021 · 1 comment
Open

Inaccurate copy number #45

oleraj opened this issue Sep 28, 2021 · 1 comment

Comments

@oleraj
Copy link

oleraj commented Sep 28, 2021

Hi,

We ran the example dataset with StrA and StrB and then compared the results for case-control to the original reads and it appears the copy number estimate is maybe a little underestimated. Perhaps you can provide some insight into what we're seeing?

Here are the results we got from the case-control analysis:

head example_dataset.casecontrol_locus.tsv
contig	start	end	motif	pvalue	bonf_pvalue	counts
StrA	1723	2123	AGC	0.03854993587177091	0.07709987174354183	sample1:8.461518151815183,sample2:33.28581629341123,sample3:30.048237956676367,sample4:5.298941798941799,sample5:9.576095617529882
StrB	1684	2154	CCG	0.07864960352514261	0.15729920705028522	sample1:22.211485148514853,sample2:22.88399870172022,sample3:31.08438409311348,sample4:29.674074074074078,sample5:20.21620185922975,sample6:21.3226879574185,sample7:20.358141089936474

Here's a screenshot from IGV:
image

If you notice for sample3 in the results above it says there are 30 copies of the AGC motif on StrA. However in IGV, the trimmed repeat portion is 130bp (130S20M CIGAR for left read in the pair), which would correspond to > 40 copies. I would assume the tool should be able to use the maximum length from any given read to determine the copy number. Is that not how it works?

Thanks,

Andrew

@egor-dolzhenko
Copy link
Contributor

Hello Andrew,

Thank you for the question. EHdn performs a relatively simple analysis by identifying and counting reads that are fully contained inside the repeat sequence. Although there is a linear relationship between these counts and the repeat sizes, EHdn does not estimate repeat sizes because it does not know if the expansion occurred on one or both alleles and because it does not know the exact (single-nucleotide resolution) reference coordinates of the repeat.

To summarize, EHdn just determines approximate coordinates of long repeats and produces a rough measure of their length. This information can be used to detect the presence of an expansion but not the exact repeat genotype. In the highlighted example, EHdn found ~30 reads consisting of the AGC motif on StrA.

We are starting to work on the next version of EHdn that will generate much more intuitive results (including the exact repeat sizes).

Relatedly, we recently released a genome-wide repeat catalog compatible with the latest version of (regular) ExpansionHunter. We believe that this catalog contains a good set of targets for studies aimed at discovering new pathogenic / functionally important repeats (though we are still working on a document explaining why we think so).

Please let me know if anything about my answer is unclear or if you have any other questions. If you'd like to discuss the specific dataset you are working with, please feel free to reach out directly by email.

Best wishes,
Egor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants