-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHDN truncated output #47
Comments
Thanks for the question! The repeat units reported by EHDN always begin with As and Cs. This is because the program does not determine the exact reference orientation of the repeat and all repeat units that are equivalent up to permutation of symbols or reverse complement are considered equivalent. For example, these repeats are considered equivalent:
So EHDN considers CAG, AGC, GCA, CTG, TGC, GCT to be equivalent. It picks a single representative repeat unit among all of these: it chooses the smallest unit in lexicographical order (in this case it is AGC). This is why all units it reports begin with either A or C. Could you please confirm that you don't have any motifs starting with the letter C in the full output of your program? If not, what kind of data are you working with? Do you know if your data is PCR-free or PCR+? |
Hi Egor, Thanks so much for your reply. Looking over several motif files i can see very few motifs starting with C, maybe less than 1 in 100. I believe the sequences were generated by PCR+, do you think that could be the cause? FYI I have also looked over some more locus files and they do seem to have some that start with C, so I am seeing the lack of C motifs much more in the motif files rather than the locus files. Thanks |
Yes, this makes sense. Motifs starting with a C are 100% GC (since if motif contains an A or T then there is an equivalent motif starting with an A as outlined above). And long, 100% GC repeats (such as those reported in the motif files) may not amplify well in PCR+ data. |
Hi, I have been trying to use expansion hunter denovo to
analysis some WGSs in cram format and I am getting motif/locus output files
that appear truncated. For instance both my motif and locus files only contain
repeat units that start with the nucleotide ‘A’. I am wondering if this is a problem
you have ever encounter before and if you have any advice for debugging this
problem? I have tried running the program on a more powerful cluster and also
experimented with converting the files from cram to bam and trying different MAPQ
flag values but this has had little effect. The program is outputting no error
messages and appears to be running to completion. I have also run the same analysis on the original expansion hunter software using your recently released genome wide variant catalog, which is producing more reasonalbe results and detecting lots of repeats starting with letters other than 'A'.
Below is an example of the
output motif file for chr22 from one of my sample (I have changed some of the
numbers to protect donor confidentiality). The file only has 24 repeats all starting with 'A'. Thanks very much for any help you
can provide.
The text was updated successfully, but these errors were encountered: