Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patched 0 SNPs -> an old .tbi index file seems to be culprit. #24

Open
mikelove opened this issue Dec 6, 2021 · 3 comments
Open

Patched 0 SNPs -> an old .tbi index file seems to be culprit. #24

mikelove opened this issue Dec 6, 2021 · 3 comments
Assignees

Comments

@mikelove
Copy link

mikelove commented Dec 6, 2021

Similar to #19, but in this case, the same chromosome names are used, so that shouldn't be the issue. I'm using the genome that was used to create the VCF files, as provided by Sanger.

Using g2gtools 0.2.7 installed via conda install -c kbchoi g2gtools=0.2.7=py36_0.

If I download these VCF and FASTA for CAST:

ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

And then start the first two steps to create a transcriptome:

g2gtools vcf2vci -p 12 -i snps/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz \
  -i snps/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz \
  -o snps/CAST_EiJ.vci -s CAST_EiJ -f fasta/GRCm38_68.fa --pass --quality

g2gtools patch -p 12 -i fasta/GRCm38_68.fa -c snps/CAST_EiJ.vci.gz \
  -o fasta/CAST_EiJ.patched.fa 2> fasta/CAST_EiJ.patched.log

The log output says:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 0 SNPs total
[g2gtools] Patch complete: 00:00:14.33

And the output FASTA is equivalent to reference.

If I exclude the indels file, I get:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 552,805 SNPs total
[g2gtools] Patch complete: 00:00:18.53

I've tried re-ordering the SNP and indels file, and I've tried with and without the pass or quality filters. I've also tried merging the two VCF into one file (removing the header of the second one).

@mattjvincent
Copy link
Member

Can we see the output from the patch command with the debug flag turned on?

Try just looking at chromosome 17:

# create just a Fasta file for chromosome 17
g2gtools extract -i fasta/GRCm38_68.fa -id 17 > fasta/GRCm38_68.17.fa

# run patch on just that Fasta file with debug (-d) turned on
g2gtools patch -p 12 -i fasta/GRCm38_68.17.fa -c snps/CAST_EiJ.vci.gz \
  -o fasta/CAST_EiJ.17.patched.fa -d 2> fasta/CAST_EiJ.17.patched.log

I'd like to see the contents of fasta/CAST_EiJ.17.patched.log

@mikelove
Copy link
Author

mikelove commented Dec 6, 2021

While testing chr 17, I think I may have found the issue. I think an out-of-date .tbi file sitting alongside CAST_EiJ.vci.gz is the culprit. Not sure why vcf2vci isn't replace the index. I'm going to test now if adding tabix -p vcf <strain>.vci.gz after vcf2vci and before patch alleviates the issue.

@mikelove
Copy link
Author

mikelove commented Dec 6, 2021

This appears to have been the issue.

Either deleting the stale .tbi before running vcf2vci, or forcing a new index to be created with tabix -p vcf <strain>.vci.gz, now I get non-zero SNPs.

So it appears vcf2vci followed by patch won't overwrite an old .tbi file, although it will overwrite the old vci.gz.

==> fasta/129S1_SvImJ.patched.log <==
[g2gtools] Patched 5,173,412 SNPs total
[g2gtools] Patch complete: 00:00:44.89

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 20,668,274 SNPs total
[g2gtools] Patch complete: 00:00:50.04

@mikelove mikelove changed the title Patched 0 SNPs, this time same chromosome names Patched 0 SNPs -> and old .tbi index file seems to be culprit. Dec 6, 2021
@mikelove mikelove changed the title Patched 0 SNPs -> and old .tbi index file seems to be culprit. Patched 0 SNPs -> an old .tbi index file seems to be culprit. Dec 6, 2021
@mattjvincent mattjvincent self-assigned this Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants