Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - (de)parse subcommand for formatting fasta/metadata #783

Closed
ammaraziz opened this issue Nov 12, 2021 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@ammaraziz
Copy link

Preamble
Before detailing this feature request, I would like to say that if this request is outside the scope of augur then please feel free to close this issue. I appreciate that helping users to format data is a never ending rabbit hole and the current augur subcommands provide some very impressive and helpful solutions to common problems. At one point a line must be drawn: the user has to format the data correctly and the onus is on us to do so.

Context
Currently when using augur subcommands that require a fasta+meta files, the fasta headers are required to match the meta data exactly (the strain field in the metadata file). This feature request proposes a new deparse subcommand that ingests fasta+meta files, matching by a unique ID then outputting a correctly formatted fasta and metadata.

Description
For many augur subcommands the fasta file headers must match exactly the metadata strain field. This becomes problematic when fasta headers look like this:

hRSV/A/Cote_d'Ivoire/IPCI-021/2019|EPI_ISL_2543811|2019-09-16

But the metadata does not contain a strain field:

Accession ID	Subtype	Collection date	date	Submission date	Continent	Country	City	State
EPI_ISL_1647456	A	2018-02-28	2018-02-28	2021-04-19	Africa 	Mozambique	MaputoCity	Maputo

The above is an output of the gisaid EpiRSV database. There are separate download options for fasta and metadata resulting in the above example. There is no 'download for augur' option. But this particular issue happens frequently, eg when downloading data from viprbrc.

A method for formatting fasta + meta file together would be very handy.

Possible solution
The creation of a new subcommand deparse that operates in a similar manner to parse. While parse ingests 1 fasta file and outputs two correctly formatted fasta+meta files, deparse would ingest a fasta+metafile and output correctly formatted fasta+metafile. To match the metadata, a regex flag could match the fasta header to the metadata. Or the user must specify which field contains the unique-id to match.

The command might look like:

augur deparse \
--sequences input.fasta \
--input-metadata input.tsv \
--unqiue-id "Accession ID" \
--separator "|" \
--fields strain id date other
--id-regex "(EPI_ISL_\d+)"
--output-sequence output.fasta \
--output-metadata output.tsv

Input fasta headers:

hRSV/A/Cote_d'Ivoire/IPCI-021/2019|EPI_ISL_2543811|2019-09-16|FakeData

Input metadata:

Accession ID	Subtype	 date	Submission date	Continent	Country	City	State
EPI_ISL_1647456	A	2018-02-28	2021-04-19	Africa 	Mozambique	MaputoCity	Maputo

Output fasta headers:

A/Cote_d'Ivoire/IPCI-021/2019

Output metadata file:

strain	Accession ID	Subtype	 date	Submission date	Continent	Country	City	State	other
 A/Cote_d'Ivoire/IPCI-021/2019	EPI_ISL_1647456	A	2018-02-28	2021-04-19	Africa 	Mozambique	MaputoCity	Maputo	FakeData

Output metadata now contains the strain and other fields correctly matched by the Accession ID field by the regex (EPI_ISL_\d+) and the fasta file contains only the designation.

Possible issues

  • If fasta id matches >1 row the metadata - error?
  • If the regex matches >1 fasta but 1 unique row - warn?
  • If regex doesn't match any fasta header - error?
  • Parsing fasta headers could return more than the user specified field names - warn?

Thanks

Ammar

@ammaraziz ammaraziz added the enhancement New feature or request label Nov 12, 2021
@ammaraziz
Copy link
Author

This request is heavily related to
nextstrain/ncov#640

@jameshadfield
Copy link
Member

jameshadfield commented Nov 12, 2021

Hi @ammaraziz - I feel the pain, I must have written dozens and dozens of scripts to perform similar transforms. I don't think it's going to be possible to write a one-size-fits-all parser, however one could imagine an augur parse epirsv if there's enough demand. The "possible issues" you list probably need to be handled differently for different use cases. In the meantime, we have helper functions to make these transform scripts easier to write, for instance:

# code not tested - none of your "possible issues" are addressed - TODO ;)
from augur.io import read_metadata, read_sequences, write_sequences

metadata = read_metadata(args.metadata, id_columns=("Accession ID",))
metadata.insert(0, "strain", metadata.index.values) # create 1st column of "strain" which augur expects
metadata["something"] = "" # placeholder for additional info

with open(args.output_sequences, "w") as fh:
    for sequence in read_sequences(args.sequences):
        strain, epi_isl, date, something = sequence.name.split("|")
        ## rename the sequence name
        sequence.name = sequence.id = sequence.description = strain
        metadata.loc[metadata["strain"]==strain, "something"] = something
        # store date similarly, if necessary
        write_sequences(sequence, fh)

metadata.to_csv(args.output_metadata, index=False, sep="\t")

@ammaraziz
Copy link
Author

Hi @jameshadfield

Thanks for getting back to me so quickly. After submitting this issue and reading the ncov build issues/PR, I quickly realised the mammoth task required to achieve a unified parser.

The helper scripts look very promising, are they documented anywhere? I don't see any docs on the nextstrain docs platform.

Regarding augur parse epirsv, I think in the near future this option would be very handy for the RSV community. But what's more useful might be a nextstrain build of rsv similar to the current ncov though I think this would be driven by the RSV community.

@jameshadfield
Copy link
Member

jameshadfield commented Nov 12, 2021

But what's more useful might be a nextstrain build of rsv similar to the current ncov though I think this would be driven by the RSV community.

That would be great! We'd be happy to play a supporting role here. Such a workflow could have a script which does the EpiRSV parsing. Note that we started a basic RSV build however it's not being actively maintained (nCoV is all-consuming), and it also doesn't use EpiRSV as the starting point; it may have some helpful code for you.

The helper scripts look very promising, are they documented anywhere? I don't see any docs on the nextstrain docs platform.

Unfortunately not in this case (see #784, however in general we haven't documented augur functions much, preferring to concentrate our efforts on the command line tools instead).

@ammaraziz
Copy link
Author

Coming back to this a year later (nearly to the day...), the curate subcommand that is being discussed here #860 is a much better approach and I think meets all the needs of this request.

Closing this issue with a thank you James. The code posted above has been helpful for other needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants