Inaccurate copy number #45

oleraj · 2021-09-28T20:29:18Z

Hi,

We ran the example dataset with StrA and StrB and then compared the results for case-control to the original reads and it appears the copy number estimate is maybe a little underestimated. Perhaps you can provide some insight into what we're seeing?

Here are the results we got from the case-control analysis:

head example_dataset.casecontrol_locus.tsv
contig	start	end	motif	pvalue	bonf_pvalue	counts
StrA	1723	2123	AGC	0.03854993587177091	0.07709987174354183	sample1:8.461518151815183,sample2:33.28581629341123,sample3:30.048237956676367,sample4:5.298941798941799,sample5:9.576095617529882
StrB	1684	2154	CCG	0.07864960352514261	0.15729920705028522	sample1:22.211485148514853,sample2:22.88399870172022,sample3:31.08438409311348,sample4:29.674074074074078,sample5:20.21620185922975,sample6:21.3226879574185,sample7:20.358141089936474

Here's a screenshot from IGV:

If you notice for sample3 in the results above it says there are 30 copies of the AGC motif on StrA. However in IGV, the trimmed repeat portion is 130bp (130S20M CIGAR for left read in the pair), which would correspond to > 40 copies. I would assume the tool should be able to use the maximum length from any given read to determine the copy number. Is that not how it works?

Thanks,

Andrew

The text was updated successfully, but these errors were encountered:

egor-dolzhenko · 2021-10-04T05:36:04Z

Hello Andrew,

Thank you for the question. EHdn performs a relatively simple analysis by identifying and counting reads that are fully contained inside the repeat sequence. Although there is a linear relationship between these counts and the repeat sizes, EHdn does not estimate repeat sizes because it does not know if the expansion occurred on one or both alleles and because it does not know the exact (single-nucleotide resolution) reference coordinates of the repeat.

To summarize, EHdn just determines approximate coordinates of long repeats and produces a rough measure of their length. This information can be used to detect the presence of an expansion but not the exact repeat genotype. In the highlighted example, EHdn found ~30 reads consisting of the AGC motif on StrA.

We are starting to work on the next version of EHdn that will generate much more intuitive results (including the exact repeat sizes).

Relatedly, we recently released a genome-wide repeat catalog compatible with the latest version of (regular) ExpansionHunter. We believe that this catalog contains a good set of targets for studies aimed at discovering new pathogenic / functionally important repeats (though we are still working on a document explaining why we think so).

Please let me know if anything about my answer is unclear or if you have any other questions. If you'd like to discuss the specific dataset you are working with, please feel free to reach out directly by email.

Best wishes,
Egor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate copy number #45

Inaccurate copy number #45

oleraj commented Sep 28, 2021

egor-dolzhenko commented Oct 4, 2021

Inaccurate copy number #45

Inaccurate copy number #45

Comments

oleraj commented Sep 28, 2021

egor-dolzhenko commented Oct 4, 2021