-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Question about EHdn.. #46
Comments
Thanks for the questions Jaehyun!
Best wishes, |
Oh I found out the library prep of our cohort (pcr-amplified) was different from the one of 1KGP (pcr-free). And Hiseq X and Novaseq 6000 were used in our cohort and 1KGP separately. Do you think this makes a difference? I learned Trost et al. (2020) also used pcr-based samples. Are there any methods to correct this batch effect..? Thanks, |
Hi Jaehyun. Apologies for a late reply. Yes, comparing PCR-amplified (PCR+) and PCR-free (PCR-) data can be challenging because of the amplification biases in PCR+ data. I think restricting your analysis to PCR+ data only may be a good option. We will try to implement better amplification bias detection / correction in the future versions of EHdn that might make PCR+ vs PCR- comparisons possible. And yes, we don't officially support using ExpansionHunter with PCR+ data. However, multiple research groups analyzed PCR+ data with ExpansionHunter and obtained good results. (By performing additional analysis of the STR genotypes reported by ExpansionHunter to assess their accuracy.) I think it could still be useful to analyze the genome-wide STR catalog in some of your samples using ExpansionHunter and then inspect the results with REViewer. |
Thanks for your reply! |
Hi Jaehyun, I just wanted to clarify about the use/non-use of samples with PCR+ library prep in Trost et al 2020. We did some initial QC comparing different sequencing platforms and library prep methods (Extended Data Fig. 2a), and based on these results, we excluded PCR+ samples from the main statistical analysis. Cheers, Brett |
Appreciate for your feedback, Brett. Can I ask you one more question? I think checking the size of tandem repeats unique in our ASD cohort is meaningful. So, in addition to a genome-wide repeat catalog you recommended, I have a plan to use a custom catalog based on my EHdn outputs. I'm making a catalog using Tandem Repeat Finder outputs to match loci in EHdn str_profile with reference genome tandem repeat coordinates within 500bp because the coordinates in EHdn, I understand, represent the location of anchored IIRs. Is It okay to make a catalog like this? And it seems that EHdn outputs motifs regardless of the reference genome orientation. Is it right? then how did you address this issue in Trost et al. (2020)? Thanks in advance. |
Hi JaeHyun, I am not completely sure I understand your question, but let me try to answer. Are you wanting to take loci detected by EHdn and genotype them using ExpansionHunter? If so, then using in your variant-catalog file the Tandem Repeats Finder coordinates that correspond to the EHdn coordinates is the right thing to do. Regarding the motif, EHdn reports the canonicalized motif (the lowest-alphabetically representation of the motif under shift and reverse-complement operations), but for the ExpansionHunter variant-catalog, you will need to use the version of the motif that is in reference orientation. I believe that Tandem Repeats Finder always gives the motif in reference orientation, but please double-check this. Please let me know if this answers your question. Cheers, Brett |
Thank you for quick reply! Yes, you got it. But owing to my poor computing skills, I have some difficulty matching with motif under shift or reverse-complement operations.. Please can you share codes or recommend python package about this? It'll really helpful for me. Thanks, |
Hi, Here is the Python code to do that. The ComputeCanonicalRepeatUnit function takes as input the sequence of a motif and returns the canonicalized version of the motif. For example, ComputeCanonicalRepeatUnit("CTG") == "AGC". Full disclosure, to derive this code I merely translated the equivalent routines in Egor's EHdn C++ code to Python. :) Cheers, Brett
|
Thank you so much! It would be a great help. |
Hi, I have one more question about the tandem repeats not matched with TRF coordinates. Here is a plot for the distribution of motif length. I matched loci in EHdn str_profile with reference genome tandem repeat coordinates within 1,000bp. There are many VNTRs(longer than 7bp) not matched with TRF coordinates, and I understood that's possible because of the diversity of VNTR sequence composition. But the thing is the repeats of which motif length is 6bp. The percentage of AACCCT is ,surprisingly, 92% in 6bp motifs. But the percentage of AACCCT is not that high in TRF. I knew that these repeats consist of telomeres. I thought maybe the repeats in EHdn results are false positives. What do you think about it? And how did you estimate the size of the repeats not matched with TRF? Again, thank you for taking the time to help. |
Hi JaeHyun, Sorry for the slow replies. This looks like an interesting observation that would be useful to follow up on. We are currently working on a tool for automated assessment of EHdn results. I think this tool will be the perfect way to assess your results. Would you be interested in testing out the prototype of this tool? If yes, an early prototype should be ready by late November / early December. Best wishes, |
Yes. It would be great if I take part in. I look forward to testing your tool! Thanks, |
Hello!
To detect ASD-related tandem repeat loci in my data, I'm refering to Trost et al. (2020). While I run an outlier analysis in EHdn, I have some questions.
I knew that depth-normalized counts are calculated as the raw count * 40 / the average read depth of the sample and the normalized counts are used to filter tandem repeat regions in Trost et al. (counts ≥ 2). I could understand the normalization of the counts by dividing the avg counts because of comparison between samples, but I had little understanding of the numerator. Is there any reason for setting a specific value (40)?
I got the counts of anchored IIRs in our cohort and an another cohort(1KGP) from EHdn. After filtering out some tandem repeats where too many samples have no count, I ran PCA of the count of anchored IIRs to check any ancestry or batch effect.
Here is the PCA result. It seems that they are split into our cohort(ASD; Korean) and 1KG samples. I checked the versions of gatk, bwa-mem, etc., but they were similar. Is there anything else to consider? Or maybe the count of anchored IIRs is not suitable for PCA?
For example, this image is an IGV screenshot showing one anchored IIR checked.
And here is the EHdn result corresponding to the read above. I think two reads should appear in IGV. I wonder if I misread the result.
Thanks in advance.
The text was updated successfully, but these errors were encountered: