Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHDN truncated output #47

Open
jonnyelse opened this issue Nov 10, 2021 · 3 comments
Open

EHDN truncated output #47

jonnyelse opened this issue Nov 10, 2021 · 3 comments

Comments

@jonnyelse
Copy link

Hi, I have been trying to use expansion hunter denovo to
analysis some WGSs in cram format and I am getting motif/locus output files
that appear truncated. For instance both my motif and locus files only contain
repeat units that start with the nucleotide ‘A’. I am wondering if this is a problem
you have ever encounter before and if you have any advice for debugging this
problem? I have tried running the program on a more powerful cluster and also
experimented with converting the files from cram to bam and trying different MAPQ
flag values but this has had little effect. The program is outputting no error
messages and appears to be running to completion. I have also run the same analysis on the original expansion hunter software using your recently released genome wide variant catalog, which is producing more reasonalbe results and detecting lots of repeats starting with letters other than 'A'.
Below is an example of the
output motif file for chr22 from one of my sample (I have changed some of the
numbers to protect donor confidentiality). The file only has 24 repeats all starting with 'A'. Thanks very much for any help you
can provide.

motif num_paired_irrs norm_num_paired_irrs
AAAATACACCACACAC 2 2.11
AAAGGGAGGAGAGAGAG 2 1.11
AAAGGGAGGAGAGAGAGAG 3 3.44
AACCCACTCCCAAGATAACG 1 2.11
AACCCACTCCCAAGATAACT 3 1.22
AACCCT 67 76.77
AACCTCCATAACCTCCCT 14 14.45
AACCTCCCT 9 11.01
AACGTACACACAGGGG 1 1.22
AAGGAGGG 1 1.11
AATGG 10 13.37
AATGGAATGGAATGGAGTGG 2 2.33
ACACACACGCCCCGC 1 1.11
ACAGCGTCCTCACCGCCCCC 2 2.22
ACAGCGTCCTCACCGCCCCG 1 1.11
ACAGCGTCCTCACCGCGTCC 2 2.22
ACAGCGTCCTCACCTCCCCG 2 3.33
ACAGCGTCCTCATCGCGTCC 1 1.11
ACC 1 2.22
AG 2 1.33
AGAGGAGGAGG 1 2.11
AGAGGGATGG 1 1.11
AGGAGGAGGCGGCGG 1 2.22
ATCC 6 8.88
@egor-dolzhenko
Copy link
Contributor

Thanks for the question!

The repeat units reported by EHDN always begin with As and Cs. This is because the program does not determine the exact reference orientation of the repeat and all repeat units that are equivalent up to permutation of symbols or reverse complement are considered equivalent. For example, these repeats are considered equivalent:

CAGCAGCAGCAGCAG – repeat unit CAG
AGCAGCAGCAGCAGC – repeat unit AGC
GCAGCAGCAGCAGCA – repeat unit GCA
CTGCTGCTGCTGCTG – repeat unit CTG (reverse complement if CAG)
TGCTGCTGCTGCTGC – repeat unit TGC
GCTGCTGCTGCTGCT – repeat unit GCT

So EHDN considers CAG, AGC, GCA, CTG, TGC, GCT to be equivalent. It picks a single representative repeat unit among all of these: it chooses the smallest unit in lexicographical order (in this case it is AGC). This is why all units it reports begin with either A or C.

Could you please confirm that you don't have any motifs starting with the letter C in the full output of your program? If not, what kind of data are you working with? Do you know if your data is PCR-free or PCR+?

@jonnyelse
Copy link
Author

Hi Egor, Thanks so much for your reply. Looking over several motif files i can see very few motifs starting with C, maybe less than 1 in 100. I believe the sequences were generated by PCR+, do you think that could be the cause? FYI I have also looked over some more locus files and they do seem to have some that start with C, so I am seeing the lack of C motifs much more in the motif files rather than the locus files. Thanks

@egor-dolzhenko
Copy link
Contributor

Yes, this makes sense. Motifs starting with a C are 100% GC (since if motif contains an A or T then there is an equivalent motif starting with an A as outlined above). And long, 100% GC repeats (such as those reported in the motif files) may not amplify well in PCR+ data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants