-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - (de)parse subcommand for formatting fasta/metadata #783
Comments
This request is heavily related to |
Hi @ammaraziz - I feel the pain, I must have written dozens and dozens of scripts to perform similar transforms. I don't think it's going to be possible to write a one-size-fits-all parser, however one could imagine an # code not tested - none of your "possible issues" are addressed - TODO ;)
from augur.io import read_metadata, read_sequences, write_sequences
metadata = read_metadata(args.metadata, id_columns=("Accession ID",))
metadata.insert(0, "strain", metadata.index.values) # create 1st column of "strain" which augur expects
metadata["something"] = "" # placeholder for additional info
with open(args.output_sequences, "w") as fh:
for sequence in read_sequences(args.sequences):
strain, epi_isl, date, something = sequence.name.split("|")
## rename the sequence name
sequence.name = sequence.id = sequence.description = strain
metadata.loc[metadata["strain"]==strain, "something"] = something
# store date similarly, if necessary
write_sequences(sequence, fh)
metadata.to_csv(args.output_metadata, index=False, sep="\t") |
Thanks for getting back to me so quickly. After submitting this issue and reading the ncov build issues/PR, I quickly realised the mammoth task required to achieve a unified parser. The helper scripts look very promising, are they documented anywhere? I don't see any docs on the nextstrain docs platform. Regarding |
That would be great! We'd be happy to play a supporting role here. Such a workflow could have a script which does the EpiRSV parsing. Note that we started a basic RSV build however it's not being actively maintained (nCoV is all-consuming), and it also doesn't use EpiRSV as the starting point; it may have some helpful code for you.
Unfortunately not in this case (see #784, however in general we haven't documented augur functions much, preferring to concentrate our efforts on the command line tools instead). |
Coming back to this a year later (nearly to the day...), the Closing this issue with a thank you James. The code posted above has been helpful for other needs. |
Preamble
Before detailing this feature request, I would like to say that if this request is outside the scope of
augur
then please feel free to close this issue. I appreciate that helping users to format data is a never ending rabbit hole and the currentaugur
subcommands provide some very impressive and helpful solutions to common problems. At one point a line must be drawn: the user has to format the data correctly and the onus is on us to do so.Context
Currently when using
augur
subcommands that require a fasta+meta files, the fasta headers are required to match the meta data exactly (thestrain
field in the metadata file). This feature request proposes a newdeparse
subcommand that ingests fasta+meta files, matching by a unique ID then outputting a correctly formatted fasta and metadata.Description
For many
augur
subcommands the fasta file headers must match exactly the metadatastrain
field. This becomes problematic when fasta headers look like this:But the metadata does not contain a
strain
field:The above is an output of the gisaid EpiRSV database. There are separate download options for fasta and metadata resulting in the above example. There is no 'download for augur' option. But this particular issue happens frequently, eg when downloading data from viprbrc.
A method for formatting fasta + meta file together would be very handy.
Possible solution
The creation of a new subcommand
deparse
that operates in a similar manner toparse
. While parse ingests 1 fasta file and outputs two correctly formatted fasta+meta files,deparse
would ingest a fasta+metafile and output correctly formatted fasta+metafile. To match the metadata, a regex flag could match the fasta header to the metadata. Or the user must specify which field contains the unique-id to match.The command might look like:
Input fasta headers:
Input metadata:
Output fasta headers:
Output metadata file:
Output metadata now contains the
strain
andother
fields correctly matched by theAccession ID
field by the regex(EPI_ISL_\d+)
and the fasta file contains only the designation.Possible issues
Thanks
Ammar
The text was updated successfully, but these errors were encountered: