Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting EHDN motif-based analysis results #49

Open
gspirito opened this issue Nov 30, 2021 · 4 comments
Open

Interpreting EHDN motif-based analysis results #49

gspirito opened this issue Nov 30, 2021 · 4 comments

Comments

@gspirito
Copy link

Hi, I wanted to ask some questions about the motif-based outlier analysis.

I have a cohort of 40 individuals (WGS), and I suspect that one of them may have am increased burden of repeat expansions compared to the other samples. Since I am not looking for expansions at specific loci I did an outlier motif-based analysis labeling all samples as "case" in the manifest file.

As a result I have that one sample has 44 repeat motifs with Z-score > 3, while all other samples have between 0 and 5 motifs with Z-score > 3. Would it make sense to use this result as suggestive evidence for a general increased burden of repeat expansions in that sample? What would be a suitable Z-score cutoff value?

Thank you in advance.

@mfbennett
Copy link
Contributor

Hey @gspirito. Thanks for trying EHdn and I hope that you find it useful!

I have been thinking about this question and your approach to testing for an increased burden of repeat expansions using the motif-based rather than locus-based analysis sounds reasonable. However, based on what you described I wouldn’t take this to anything more than suggestive evidence. You might also consider try some other (complementary) approaches, which may help give addition supporting evidence.

One idea would be to run a PCA and see if this sample is an outlier compared to the rest of your cohort. You could convert the motif normalised paired-IRR counts to a matrix to to do this. However if you go down this route you may be better served running ExpansionHunter using a genome wide catalog. (@egor-dolzhenko may have some additional thoughts on this.)

@gspirito
Copy link
Author

Hi @mfbennett , thank you for the reply, I will try to do some PCAs with the motif normalised paired-IRR counts.

Regarding the analysis with ExpansionHunter and a catalog I have a few questions:

  • The default catalog has 31 loci (~/ExpansionHunter/variant_catalog/grch38/variant_catalog.json), is there a way to obtain a bigger catalog with much more loci? For example, can I covert this bed file https://s3.amazonaws.com/gangstr/hg38/genomewide/hg38_ver13.bed.gz to .json ad use it as the input catalog?

  • ExpansionHunter gives me the number of spanning and inrepeat reads for each locus, is there also a way to get the normalized counts?

Thank you

@egor-dolzhenko
Copy link
Contributor

Hi @gspirito. You can get a genome-wide STR catalog for ExpansionHunter here: https://github.com/Illumina/RepeatCatalogs/releases/tag/v1.0.0. This catalog contains repeats with similar properties to known pathogenic repeats (polymorphism, complexity of the sequence surrounding the repeat, etc.)

You could normalize the read counts by dividing each count by the locus depth (which ExpansionHunter reports) and then multiplying by the target depth. For example, if the number of in-repeat reads is 20 and the locus depth is 32x, the corresponding count normalized to 40x depth is 20 * 40 / 32 = 25. (Note that this very simplistic normalization procedure is best used when the depths are pretty similar in all the samples.)

@bucongfan
Copy link

bucongfan commented Apr 7, 2022

Hi @mfbennett, Thank you for your advise.
I follow this and using 1kgp and my datasets norm_num_paired_irrs for PCA to find outlier.
However, I found that PCA distinguishes between data sets rather than real data. Is it necessary to remove batch effect in the process of PCA?
I used norm_num_paired_IRRS, do I still need to normalized the coverage depth as egor-dolzhenko described?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants