You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context
Historically, metadata for Augur inputs were curated to exclude duplicate strains (e.g., by resolving duplicates in fauna during download from the database) and, as such, read_metadata was designed to throw an error when it found duplicate strains. However, GenBank has supported multiple versions of the same strain's sequence for a long time and GISAID recently added similar support. This means a valid metadata download from either of these sources can produce an error when Augur tries to read these data.
Description
Augur should provide an interface to resolve duplicate strain names in the metadata instead of throwing an error. We should retain the option to throw an error on duplicates, but we should also consider making duplicate resolution the default behavior.
Examples
To test the current issue, create some minimal metadata with a duplicated strain:
cut -f 1,5 data/example_metadata.tsv | head -n 4 | sed 's/VIC1008/VIC1000/'> duplicate_metadata.tsv
Then, try to load the data from a Python terminal:
Ideally, the solution to this issue will not require the user to do anything by default; the solution should allow users with duplicate strains to resolve these duplicates automatically.
To resolve duplicate GISAID or GenBank records, we want to prefer the record with the most recent database accession/id. We currently annotate accessions for GISAID and GenBank in our ncov workflow as gisaid_epi_isl and genbank_accession, respectively. One possible solution could then be:
Check for duplicates
If no duplicates, continue.
If duplicates, check for Augur config variable (either in global environment variables or a config file) for whether we should throw an error on duplicates and throw an error as configured.
If duplicates and no error to be thrown, check for one of our predefined accession columns (gisaid_epi_isl and genbank_accession to start).
If an accession column exists, sort records by strain and accession in ascending order and take the last record (or descending/first).
If an accession column does not exist, sort records by strain and take the first record.
The text was updated successfully, but these errors were encountered:
victorlin
changed the title
Provide an interface to resolve duplicate strain names in metadata
read_metadata: Provide an interface to resolve duplicate strain names in metadata
Aug 27, 2024
The root issue here (strain column in metadata having duplicates by design) has since been addressed by the id_columns parameter (initially as valid_index_cols in b78f015) which allows other columns with unique values to be used as the index column.
Context
Historically, metadata for Augur inputs were curated to exclude duplicate strains (e.g., by resolving duplicates in fauna during download from the database) and, as such,
read_metadata
was designed to throw an error when it found duplicate strains. However, GenBank has supported multiple versions of the same strain's sequence for a long time and GISAID recently added similar support. This means a valid metadata download from either of these sources can produce an error when Augur tries to read these data.Description
Augur should provide an interface to resolve duplicate strain names in the metadata instead of throwing an error. We should retain the option to throw an error on duplicates, but we should also consider making duplicate resolution the default behavior.
Examples
To test the current issue, create some minimal metadata with a duplicated strain:
Then, try to load the data from a Python terminal:
Possible solution
Ideally, the solution to this issue will not require the user to do anything by default; the solution should allow users with duplicate strains to resolve these duplicates automatically.
To resolve duplicate GISAID or GenBank records, we want to prefer the record with the most recent database accession/id. We currently annotate accessions for GISAID and GenBank in our ncov workflow as
gisaid_epi_isl
andgenbank_accession
, respectively. One possible solution could then be:gisaid_epi_isl
andgenbank_accession
to start).The text was updated successfully, but these errors were encountered: