EHDN truncated output #47

jonnyelse · 2021-11-10T12:11:55Z

Hi, I have been trying to use expansion hunter denovo to
analysis some WGSs in cram format and I am getting motif/locus output files
that appear truncated. For instance both my motif and locus files only contain
repeat units that start with the nucleotide ‘A’. I am wondering if this is a problem
you have ever encounter before and if you have any advice for debugging this
problem? I have tried running the program on a more powerful cluster and also
experimented with converting the files from cram to bam and trying different MAPQ
flag values but this has had little effect. The program is outputting no error
messages and appears to be running to completion. I have also run the same analysis on the original expansion hunter software using your recently released genome wide variant catalog, which is producing more reasonalbe results and detecting lots of repeats starting with letters other than 'A'.
Below is an example of the
output motif file for chr22 from one of my sample (I have changed some of the
numbers to protect donor confidentiality). The file only has 24 repeats all starting with 'A'. Thanks very much for any help you
can provide.

motif	num_paired_irrs	norm_num_paired_irrs
AAAATACACCACACAC	2	2.11
AAAGGGAGGAGAGAGAG	2	1.11
AAAGGGAGGAGAGAGAGAG	3	3.44
AACCCACTCCCAAGATAACG	1	2.11
AACCCACTCCCAAGATAACT	3	1.22
AACCCT	67	76.77
AACCTCCATAACCTCCCT	14	14.45
AACCTCCCT	9	11.01
AACGTACACACAGGGG	1	1.22
AAGGAGGG	1	1.11
AATGG	10	13.37
AATGGAATGGAATGGAGTGG	2	2.33
ACACACACGCCCCGC	1	1.11
ACAGCGTCCTCACCGCCCCC	2	2.22
ACAGCGTCCTCACCGCCCCG	1	1.11
ACAGCGTCCTCACCGCGTCC	2	2.22
ACAGCGTCCTCACCTCCCCG	2	3.33
ACAGCGTCCTCATCGCGTCC	1	1.11
ACC	1	2.22
AG	2	1.33
AGAGGAGGAGG	1	2.11
AGAGGGATGG	1	1.11
AGGAGGAGGCGGCGG	1	2.22
ATCC	6	8.88

egor-dolzhenko · 2021-11-10T22:28:30Z

Thanks for the question!

The repeat units reported by EHDN always begin with As and Cs. This is because the program does not determine the exact reference orientation of the repeat and all repeat units that are equivalent up to permutation of symbols or reverse complement are considered equivalent. For example, these repeats are considered equivalent:

CAGCAGCAGCAGCAG – repeat unit CAG
AGCAGCAGCAGCAGC – repeat unit AGC
GCAGCAGCAGCAGCA – repeat unit GCA
CTGCTGCTGCTGCTG – repeat unit CTG (reverse complement if CAG)
TGCTGCTGCTGCTGC – repeat unit TGC
GCTGCTGCTGCTGCT – repeat unit GCT

So EHDN considers CAG, AGC, GCA, CTG, TGC, GCT to be equivalent. It picks a single representative repeat unit among all of these: it chooses the smallest unit in lexicographical order (in this case it is AGC). This is why all units it reports begin with either A or C.

Could you please confirm that you don't have any motifs starting with the letter C in the full output of your program? If not, what kind of data are you working with? Do you know if your data is PCR-free or PCR+?

jonnyelse · 2021-11-11T16:18:40Z

Hi Egor, Thanks so much for your reply. Looking over several motif files i can see very few motifs starting with C, maybe less than 1 in 100. I believe the sequences were generated by PCR+, do you think that could be the cause? FYI I have also looked over some more locus files and they do seem to have some that start with C, so I am seeing the lack of C motifs much more in the motif files rather than the locus files. Thanks

egor-dolzhenko · 2021-11-15T20:42:48Z

Yes, this makes sense. Motifs starting with a C are 100% GC (since if motif contains an A or T then there is an equivalent motif starting with an A as outlined above). And long, 100% GC repeats (such as those reported in the motif files) may not amplify well in PCR+ data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHDN truncated output #47

EHDN truncated output #47

jonnyelse commented Nov 10, 2021

egor-dolzhenko commented Nov 10, 2021

jonnyelse commented Nov 11, 2021

egor-dolzhenko commented Nov 15, 2021

EHDN truncated output #47

EHDN truncated output #47

Comments

jonnyelse commented Nov 10, 2021

egor-dolzhenko commented Nov 10, 2021

jonnyelse commented Nov 11, 2021

egor-dolzhenko commented Nov 15, 2021