Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated and missing entries in .methphased.vcf #12

Open
liuy0421 opened this issue Jun 29, 2023 · 6 comments
Open

Duplicated and missing entries in .methphased.vcf #12

liuy0421 opened this issue Jun 29, 2023 · 6 comments

Comments

@liuy0421
Copy link

When I look at the entries in the output .methphased.vcf, there seems to be complete duplicate rows (not just that one variant is methphased into different haplotypes and therefore split into multiple entires with different PS tags, but entirely duplicate entires). This is not a big issue because complete duplicates can be easily dropped - but what could be causing this? Is this intentional?

There also seems to be variants in the original .vcf file that's missing from the output .methphased.vcf, including some that was phased in the original .vcf. Is that intentional?

The original .vcf was filtered to only contain those on autosomes, and the input .bam files are filtered to only contain primary alignments but were not filtered to only keep those that map to autosomes.

Thank you so much!

@Fu-Yilei
Copy link
Collaborator

Should be a bug, could you please show me the vcf file? I have never used MethPhaser on mice samples so a little example could help.

@liuy0421
Copy link
Author

It's not public data - I'll have to ask if it could be shared. I may get back to you on this at a later date. Sorry I can't be more helpful now.

Although one thing I encountered from the current implementation is that, since I filtered the .vcf file for only those on chromosome 1-19 (mouse autosomes), phase_block_df (defined here) defaults the dtype for the chr column to be int instead of str, and the subsequent comparison fails as int(chromosome) == str(chromosome) returns False. As a result all chromosomes were being skipped. Adding dtype={'chr' : 'str' } to the pd.read_csv() call in defining phase_block_df fixed the issue. A similar issue is also in methphasing.

@Fu-Yilei
Copy link
Collaborator

I think that was because most human autosomes can be parsed as int but I can fix that, thanks for pointing out! Btw what is your reference genome headers like? Maybe that was different from human autosomes and could be the issue, not sure.

@liuy0421
Copy link
Author

We use mm39 as reference.

Now I think about it - this line here reads in info from the reference (and the next line gets the chromosome column). If there are non-autosome entries in the reference, there will be non-numerics in their names, and the chromosome column dtype will be str. In this case, if the .vcf file is filtered for autosomes and the chromosome column dtype gets defaulted to int, the int to str comparison fails.

@Fu-Yilei
Copy link
Collaborator

I will download mm39 and take a look, thanks!

@Fu-Yilei
Copy link
Collaborator

Yeah mm39 starts with something like NC_000067.7 and HG38 starts with chr22. I will modify the dataframe reader, thanks for pointing that out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants