Difficulty understanding the output from EHDN #48

sagnikbanerjee15 · 2021-11-15T23:25:02Z

Hello,

I have executed EHDN on a CRAM file with paired-ended reads of length 150 with the parameters --min-anchor-mapq 0 and --max-irr-mapq 50. I started analyzing the locus.tsv file and I came upon the entry chr9 27573523 27573524 CCCCCGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGGCCCCGG 1 0.92 2. The length of the motif is 66 which is less than 150. But I was under the impression that EHDN would report only those repeats that are longer than the read length. The AnchoredIrrCount for this repeat was 1 and the IrrPairCount was 0.

Also, could you please explain what the column het_str_size represents?

Thank you.

The text was updated successfully, but these errors were encountered:

egor-dolzhenko · 2021-11-18T14:29:09Z

Thanks for the questions! You are looking into the supplementary TSV file generated by the "profile" command, right? If yes, the forth column says that the repeat unit (motif) is CCCCCGCCCCGGCCCCGGCCCCGGCCCCGGC... and the fifth column says that EHDN identified just a single read with that motif. This information suggests that the repeat size is close to the read length.

The het_str_size is a very approximate repeat size estimate which is made assuming that the long repeat occurs only on one allele. In this case, it equals to 2 meaning that the program roughly estimated the size of the repeat to be 66 * 2 = 132bp. (Though since EHDN identified one in-repeat read, 132bp is likely an underestimate and the true size of the repeat is at least 150bp.)

Also note that EHDN determines repeat units from the sequence of the read and, generally, estimation of motifs longer than 15bp can be prone to error.

Did I answer your questions? Please let me know if I didn't explain something well.

sagnikbanerjee15 · 2021-11-18T16:11:12Z

Hello @egor-dolzhenko,

Thank you so much for providing a detailed explanation. Yes, I am looking at the supplementary tsv file created by the "profile" command.

I was actually trying to find a well-known repeat from ALS - (GGCCCC)* located at 9:27573528-27573546 of the C9ORF72 gene. When I ran ExpansionHunter (v5.0.0) on the same data, I got ~300 expansions of the repeat. That should be at least 1800 sized repeat. But EHDN was not able to find it. Is there any other parameter setting that I could try?

Thank you.

egor-dolzhenko · 2021-11-18T16:35:15Z

@sagnikbanerjee15 Thank you for letting me know. Would it be permissible for you to share parts of EHdn output for this sample? If yes, perhaps we could follow up by email?

sagnikbanerjee15 · 2022-03-29T22:50:55Z

Hello,

Would it be possible to provide some support for this issue? Last I spoke with @egor-dolzhenko he was going to send me the strawberry program to better locate the expansion sites. Since then he has moved on. So I was wondering if someone could help me with that.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty understanding the output from EHDN #48

Difficulty understanding the output from EHDN #48

sagnikbanerjee15 commented Nov 15, 2021

egor-dolzhenko commented Nov 18, 2021

sagnikbanerjee15 commented Nov 18, 2021

egor-dolzhenko commented Nov 18, 2021

sagnikbanerjee15 commented Mar 29, 2022

Difficulty understanding the output from EHDN #48

Difficulty understanding the output from EHDN #48

Comments

sagnikbanerjee15 commented Nov 15, 2021

egor-dolzhenko commented Nov 18, 2021

sagnikbanerjee15 commented Nov 18, 2021

egor-dolzhenko commented Nov 18, 2021

sagnikbanerjee15 commented Mar 29, 2022