Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf2fasta - new options? #72

Open
sprocha opened this issue Dec 5, 2017 · 7 comments
Open

vcf2fasta - new options? #72

sprocha opened this issue Dec 5, 2017 · 7 comments

Comments

@sprocha
Copy link

sprocha commented Dec 5, 2017

suggestion: would be very useful to have a tool that can take eg. single-individual vcf's (or variable-number of individuals ones) and provide consensus fasta (using ambiguities) or 2 seqs per individual fasta (but with randomised alleles in each chromosome)

@uribe-convers
Copy link
Member

Hey @sprocha,

I have such a script! It will take a VCF file with biallelic data exported using VCF-to-Tab and return a file where the biallelic data has been combined into a single nucleotide using IUPAC ambiguity codes. The code can be found here

Maybe @josephwb can add it to phyx if he thinks it belongs there. Otherwise, you can use it from the link above :)

@josephwb
Copy link
Member

josephwb commented Dec 5, 2017

That would be a great headstart! I had never heard of vcf before writing the existing function...

@sprocha
Copy link
Author

sprocha commented Dec 5, 2017 via email

@sprocha
Copy link
Author

sprocha commented Dec 5, 2017 via email

@uribe-convers
Copy link
Member

@sprocha, yes, this script will generate ambiguity codes only for the variant sites, which are (I think) the only sites you have in a VCF file exported with VCF-to-Tab. If you have the complete sequences in a fasta file, you can use the same script to replace the biallelic sites for an ambiguous nucleotide—the script doesn't care if the sites are SNPs or song lyrics, it's just searching and replacing patterns.

Now, if you want to have two sequences, i.e., the alleles, be mindful that you'll need to phase the variant sites!

@josephwb
Copy link
Member

Hey @sprocha. Sorry this has not been addressed. Hopefully you've been able to accomplish this in some other way.

Do you happen to have:

  1. example input
  2. expected output

That would help on our end. ( -_・)

@sprocha
Copy link
Author

sprocha commented Apr 23, 2018

Hi!
I managed what I needed, yes. Thanks.
The vcf to consensus fasta (IUPAC) I did using the "bcftools consensus" option here:
https://samtools.github.io/bcftools/bcftools.html

The "2 seqs per individual fasta (but with randomised alleles in each chromosome)" would still be awesome to have (we still did not write the code for that). The thing to take into account to use it in phylogenetic inference is the randomization of REF/ALT states: in a normal vcf the order will always be REF/ALT and then if u take those two states directly you will have one "cromossome" accumulating all the ALT positions, and eventually a "erroneously" long branch. So, randomization would be needed.

I have no examples on hand but intended input is vcf file and intended output a fasta with two seqs per individual (full sequences: non variable - identical to ref - and variable).

I will post here if we eventually write this before you ;)
many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants